• Open

    AnglE-optimized Text Embeddings. (arXiv:2309.12871v6 [cs.CL] UPDATED)
    High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.  ( 2 min )
    The Geometric Structure of Fully-Connected ReLU Layers. (arXiv:2310.03482v2 [cs.LG] UPDATED)
    We formalize and interpret the geometric structure of $d$-dimensional fully connected ReLU layers in neural networks. The parameters of a ReLU layer induce a natural partition of the input domain, such that the ReLU layer can be significantly simplified in each sector of the partition. This leads to a geometric interpretation of a ReLU layer as a projection onto a polyhedral cone followed by an affine transformation, in line with the description in [doi:10.48550/arXiv.1905.08922] for convolutional networks with ReLU activations. Further, this structure facilitates simplified expressions for preimages of the intersection between partition sectors and hyperplanes, which is useful when describing decision boundaries in a classification setting. We investigate this in detail for a feed-forward network with one hidden ReLU-layer, where we provide results on the geometric complexity of the decision boundary generated by such networks, as well as proving that modulo an affine transformation, such a network can only generate $d$ different decision boundaries. Finally, the effect of adding more layers to the network is discussed.  ( 2 min )
    Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models. (arXiv:2305.11414v2 [cs.LG] UPDATED)
    Foundation Models (FMs), such as LLaMA, BERT, GPT, ViT, and CLIP, have demonstrated remarkable success in a wide range of applications, driven by their ability to leverage vast amounts of data for pre-training. However, optimizing FMs often requires access to sensitive data, raising privacy concerns and limiting their applicability in many domains. In this paper, we propose the Federated Foundation Models (FFMs) paradigm, which combines the benefits of FMs and Federated Learning (FL) to enable privacy-preserving and collaborative learning across multiple end-users. We discuss the potential benefits and challenges of integrating FL into the lifespan of FMs, covering pre-training, fine-tuning, and application. We further outline potential future research avenues in FFM, including FFM pre-training, FFM fine-tuning, and federated prompt tuning, which allow the development of more personalized and context-aware models while ensuring data privacy. Moreover, we explore the possibility of continual/lifelong learning in FFMs, as increased computational power at the edge may unlock the potential for optimizing FMs using newly generated private data close to the data source. The proposed FFM concepts offer a flexible and scalable framework for training large language models in a privacy-preserving manner, setting the stage for subsequent advancements in both FM training and federated learning.  ( 2 min )
    Robust Graph Clustering via Meta Weighting for Noisy Graphs. (arXiv:2311.00322v2 [cs.LG] UPDATED)
    How can we find meaningful clusters in a graph robustly against noise edges? Graph clustering (i.e., dividing nodes into groups of similar ones) is a fundamental problem in graph analysis with applications in various fields. Recent studies have demonstrated that graph neural network (GNN) based approaches yield promising results for graph clustering. However, we observe that their performance degenerates significantly on graphs with noise edges, which are prevalent in practice. In this work, we propose MetaGC for robust GNN-based graph clustering. MetaGC employs a decomposable clustering loss function, which can be rephrased as a sum of losses over node pairs. We add a learnable weight to each node pair, and MetaGC adaptively adjusts the weights of node pairs using meta-weighting so that the weights of meaningful node pairs increase and the weights of less-meaningful ones (e.g., noise edges) decrease. We show empirically that MetaGC learns weights as intended and consequently outperforms the state-of-the-art GNN-based competitors, even when they are equipped with separate denoising schemes, on five real-world graphs under varying levels of noise. Our code and datasets are available at https://github.com/HyeonsooJo/MetaGC.  ( 2 min )
    Chain-of-Thought Reasoning is a Policy Improvement Operator. (arXiv:2309.08589v2 [cs.LG] UPDATED)
    Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn to add up to the longest-length-digit numbers without access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, similarly to how Monte-Carlo Tree Search is used in AlphaZero (Silver et al., 2017). We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.  ( 2 min )
    Stochastic Thermodynamics of Learning Generative Models. (arXiv:2310.19802v3 [cs.LG] UPDATED)
    We have formulated generative machine learning problems as the time evolution of Parametric Probabilistic Models (PPMs), inherently rendering a thermodynamic process. Then, we have studied the thermodynamic exchange between the model's parameters, denoted as $\Theta$, and the model's generated samples, denoted as $X$. We demonstrate that the training dataset and the action of the Stochastic Gradient Descent (SGD) optimizer serve as a work source that governs the time evolution of these two subsystems. Our findings reveal that the model learns through the dissipation of heat during the generation of samples $X$, leading to an increase in the entropy of the model's parameters, $\Theta$. Thus, the parameter subsystem acts as a heat reservoir, effectively storing the learned information. Furthermore, the role of the model's parameters as a heat reservoir provides valuable thermodynamic insights into the generalization power of over-parameterized models. This approach offers an unambiguous framework for computing information-theoretic quantities within deterministic neural networks by establishing connections with thermodynamic variables. To illustrate the utility of this framework, we introduce two information-theoretic metrics: Memorized-information (M-info) and Learned-information (L-info), which trace the dynamic flow of information during the learning process of PPMs.  ( 2 min )
    SEMQA: Semi-Extractive Multi-Source Question Answering. (arXiv:2311.04886v1 [cs.CL])
    Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.  ( 2 min )
    Quality-Diversity through AI Feedback. (arXiv:2310.13032v3 [cs.CL] UPDATED)
    In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through AI feedback, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation.  ( 3 min )
    Solution of FPK Equation for Stochastic Dynamics Subjected to Additive Gaussian Noise via Deep Learning Approach. (arXiv:2311.04511v1 [cs.LG])
    The Fokker-Plank-Kolmogorov (FPK) equation is an idealized model representing many stochastic systems commonly encountered in the analysis of stochastic structures as well as many other applications. Its solution thus provides an invaluable insight into the performance of many engineering systems. Despite its great importance, the solution of the FPK equation is still extremely challenging. For systems of practical significance, the FPK equation is usually high dimensional, rendering most of the numerical methods ineffective. In this respect, the present work introduces the FPK-DP Net as a physics-informed network that encodes the physical insights, i.e. the governing constrained differential equations emanated out of physical laws, into a deep neural network. FPK-DP Net is a mesh-free learning method that can solve the density evolution of stochastic dynamics subjected to additive white Gaussian noise without any prior simulation data and can be used as an efficient surrogate model afterward. FPK-DP Net uses the dimension-reduced FPK equation. Therefore, it can be used to address high-dimensional practical problems as well. To demonstrate the potential applicability of the proposed framework, and to study its accuracy and efficacy, numerical implementations on five different benchmark problems are investigated.  ( 3 min )
    Deep learning as a tool for quantum error reduction in quantum image processing. (arXiv:2311.04575v1 [quant-ph])
    Despite the limited availability and quantum volume of quantum computers, quantum image representation is a widely researched area. Currently developed methods use quantum entanglement to encode information about pixel positions. These methods range from using the angle parameter of the rotation gate (e.g., the Flexible Representation of Quantum Images, FRQI), sequences of qubits (e.g., Novel Enhanced Quantum Representation, NEQR), or the angle parameter of the phase shift gates (e.g., Local Phase Image Quantum Encoding, LPIQE) for storing color information. All these methods are significantly affected by decoherence and other forms of quantum noise, which is an inseparable part of quantum computing in the noisy intermediate-scale quantum era. These phenomena can highly influence the measurements and result in extracted images that are visually dissimilar to the originals. Because this process is at its foundation quantum, the computational reversal of this process is possible. There are many methods for error correction, mitigation, and reduction, but all of them use quantum computer time or additional qubits to achieve the desired result. We report the successful use of a generative adversarial network trained for image-to-image translation, in conjunction with Phase Distortion Unraveling error reduction method, for reducing overall error in images encoded using LPIQE.  ( 2 min )
    Uncertainty Quantification for Eosinophil Segmentation. (arXiv:2309.16536v2 [eess.IV] UPDATED)
    Eosinophilic Esophagitis (EoE) is an allergic condition increasing in prevalence. To diagnose EoE, pathologists must find 15 or more eosinophils within a single high-power field (400X magnification). Determining whether or not a patient has EoE can be an arduous process and any medical imaging approaches used to assist diagnosis must consider both efficiency and precision. We propose an improvement of Adorno et al's approach for quantifying eosinphils using deep image segmentation. Our new approach leverages Monte Carlo Dropout, a common approach in deep learning to reduce overfitting, to provide uncertainty quantification on current deep learning models. The uncertainty can be visualized in an output image to evaluate model performance, provide insight to how deep learning algorithms function, and assist pathologists in identifying eosinophils.  ( 2 min )
    Be Careful When Evaluating Explanations Regarding Ground Truth. (arXiv:2311.04813v1 [cs.CV])
    Evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. Driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. These are increasingly used in real-world applications like medical image analysis or robotics. We introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. Experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.  ( 2 min )
    Learning to Select SAT Encodings for Pseudo-Boolean and Linear Integer Constraints. (arXiv:2307.09342v2 [cs.AI] UPDATED)
    Many constraint satisfaction and optimisation problems can be solved effectively by encoding them as instances of the Boolean Satisfiability problem (SAT). However, even the simplest types of constraints have many encodings in the literature with widely varying performance, and the problem of selecting suitable encodings for a given problem instance is not trivial. We explore the problem of selecting encodings for pseudo-Boolean and linear constraints using a supervised machine learning approach. We show that it is possible to select encodings effectively using a standard set of features for constraint problems; however we obtain better performance with a new set of features specifically designed for the pseudo-Boolean and linear constraints. In fact, we achieve good results when selecting encodings for unseen problem classes. Our results compare favourably to AutoFolio when using the same feature set. We discuss the relative importance of instance features to the task of selecting the best encodings, and compare several variations of the machine learning method.  ( 2 min )
    FetMRQC: an open-source machine learning framework for multi-centric fetal brain MRI quality control. (arXiv:2311.04780v1 [eess.IV])
    Fetal brain MRI is becoming an increasingly relevant complement to neurosonography for perinatal diagnosis, allowing fundamental insights into fetal brain development throughout gestation. However, uncontrolled fetal motion and heterogeneity in acquisition protocols lead to data of variable quality, potentially biasing the outcome of subsequent studies. We present FetMRQC, an open-source machine-learning framework for automated image quality assessment and quality control that is robust to domain shifts induced by the heterogeneity of clinical data. FetMRQC extracts an ensemble of quality metrics from unprocessed anatomical MRI and combines them to predict experts' ratings using random forests. We validate our framework on a pioneeringly large and diverse dataset of more than 1600 manually rated fetal brain T2-weighted images from four clinical centers and 13 different scanners. Our study shows that FetMRQC's predictions generalize well to unseen data while being interpretable. FetMRQC is a step towards more robust fetal brain neuroimaging, which has the potential to shed new insights on the developing human brain.  ( 3 min )
    Hierarchically Gated Recurrent Neural Network for Sequence Modeling. (arXiv:2311.04823v1 [cs.CL])
    Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https://github.com/OpenNLPLab/HGRN.  ( 2 min )
    Explainable AI for Earth Observation: Current Methods, Open Challenges, and Opportunities. (arXiv:2311.04491v1 [cs.AI])
    Deep learning has taken by storm all fields involved in data analysis, including remote sensing for Earth observation. However, despite significant advances in terms of performance, its lack of explainability and interpretability, inherent to neural networks in general since their inception, remains a major source of criticism. Hence it comes as no surprise that the expansion of deep learning methods in remote sensing is being accompanied by increasingly intensive efforts oriented towards addressing this drawback through the exploration of a wide spectrum of Explainable Artificial Intelligence techniques. This chapter, organized according to prominent Earth observation application fields, presents a panorama of the state-of-the-art in explainable remote sensing image analysis.  ( 2 min )
    Distributed Agent-Based Collaborative Learning in Cross-Individual Wearable Sensor-Based Human Activity Recognition. (arXiv:2311.04236v1 [eess.SP])
    The rapid growth of wearable sensor technologies holds substantial promise for the field of personalized and context-aware Human Activity Recognition. Given the inherently decentralized nature of data sources within this domain, the utilization of multi-agent systems with their inherent decentralization capabilities presents an opportunity to facilitate the development of scalable, adaptable, and privacy-conscious methodologies. This paper introduces a collaborative distributed learning approach rooted in multi-agent principles, wherein individual users of sensor-equipped devices function as agents within a distributed network, collectively contributing to the comprehensive process of learning and classifying human activities. In this proposed methodology, not only is the privacy of activity monitoring data upheld for each individual, eliminating the need for an external server to oversee the learning process, but the system also exhibits the potential to surmount the limitations of conventional centralized models and adapt to the unique attributes of each user. The proposed approach has been empirically tested on two publicly accessible human activity recognition datasets, specifically PAMAP2 and HARTH, across varying settings. The provided empirical results conclusively highlight the efficacy of inter-individual collaborative learning when contrasted with centralized configurations, both in terms of local and global generalization.
    Online Learning Quantum States with the Logarithmic Loss via VB-FTRL. (arXiv:2311.04237v1 [quant-ph])
    Online learning quantum states with the logarithmic loss (LL-OLQS) is a quantum generalization of online portfolio selection, a classic open problem in the field of online learning for over three decades. The problem also emerges in designing randomized optimization algorithms for maximum-likelihood quantum state tomography. Recently, Jezequel et al. (arXiv:2209.13932) proposed the VB-FTRL algorithm, the first nearly regret-optimal algorithm for OPS with moderate computational complexity. In this note, we generalize VB-FTRL for LL-OLQS. Let $d$ denote the dimension and $T$ the number of rounds. The generalized algorithm achieves a regret rate of $O ( d^2 \log ( d + T ) )$ for LL-OLQS. Each iteration of the algorithm consists of solving a semidefinite program that can be implemented in polynomial time by, e.g., cutting-plane methods. For comparison, the best-known regret rate for LL-OLQS is currently $O ( d^2 \log T )$, achieved by the exponential weight method. However, there is no explicit implementation available for the exponential weight method for LL-OLQS. To facilitate the generalization, we introduce the notion of VB-convexity. VB-convexity is a sufficient condition for the logarithmic barrier associated with any function to be convex and is of independent interest.
    Solving Kernel Ridge Regression with Gradient-Based Optimization Methods. (arXiv:2306.16838v3 [stat.ML] UPDATED)
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.
    More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime-validity. (arXiv:2306.12214v2 [stat.ML] UPDATED)
    In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast rate and mixed rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, we extend all previous results to anytime-valid bounds using a simple technique applicable to any existing bound.
    Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features. (arXiv:2307.09933v2 [cs.LG] UPDATED)
    To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.
    Efficient Beam Tree Recursion. (arXiv:2307.10779v2 [cs.LG] UPDATED)
    Beam Tree Recursive Neural Network (BT-RvNN) was recently proposed as a simple extension of Gumbel Tree RvNN and it was shown to achieve state-of-the-art length generalization performance in ListOps while maintaining comparable performance on other tasks. However, although not the worst in its kind, BT-RvNN can be still exorbitantly expensive in memory usage. In this paper, we identify the main bottleneck in BT-RvNN's memory usage to be the entanglement of the scorer function and the recursive cell function. We propose strategies to remove this bottleneck and further simplify its memory usage. Overall, our strategies not only reduce the memory usage of BT-RvNN by $10$-$16$ times but also create a new state-of-the-art in ListOps while maintaining similar performance in other tasks. In addition, we also propose a strategy to utilize the induced latent-tree node representations produced by BT-RvNN to turn BT-RvNN from a sentence encoder of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{d}$ into a sequence contextualizer of the form $f:\mathbb{R}^{n \times d} \rightarrow \mathbb{R}^{n \times d}$. Thus, our proposals not only open up a path for further scalability of RvNNs but also standardize a way to use BT-RvNNs as another building block in the deep learning toolkit that can be easily stacked or interfaced with other popular models such as Transformers and Structured State Space models.
    Decentralized Personalized Online Federated Learning. (arXiv:2311.04817v1 [cs.LG])
    Vanilla federated learning does not support learning in an online environment, learning a personalized model on each client, and learning in a decentralized setting. There are existing methods extending federated learning in each of the three aspects. However, some important applications on enterprise edge servers (e.g. online item recommendation at global scale) involve the three aspects at the same time. Therefore, we propose a new learning setting \textit{Decentralized Personalized Online Federated Learning} that considers all the three aspects at the same time. In this new setting for learning, the first technical challenge is how to aggregate the shared model parameters from neighboring clients to obtain a personalized local model with good performance on each client. We propose to directly learn an aggregation by optimizing the performance of the local model with respect to the aggregation weights. This not only improves personalization of each local model but also helps the local model adapting to potential data shift by intelligently incorporating the right amount of information from its neighbors. The second challenge is how to select the neighbors for each client. We propose a peer selection method based on the learned aggregation weights enabling each client to select the most helpful neighbors and reduce communication cost at the same time. We verify the effectiveness and robustness of our proposed method on three real-world item recommendation datasets and one air quality prediction dataset.
    Algorithms for Non-Negative Matrix Factorization on Noisy Data With Negative Values. (arXiv:2311.04855v1 [astro-ph.IM])
    Non-negative matrix factorization (NMF) is a dimensionality reduction technique that has shown promise for analyzing noisy data, especially astronomical data. For these datasets, the observed data may contain negative values due to noise even when the true underlying physical signal is strictly positive. Prior NMF work has not treated negative data in a statistically consistent manner, which becomes problematic for low signal-to-noise data with many negative values. In this paper we present two algorithms, Shift-NMF and Nearly-NMF, that can handle both the noisiness of the input data and also any introduced negativity. Both of these algorithms use the negative data space without clipping, and correctly recover non-negative signals without any introduced positive offset that occurs when clipping negative data. We demonstrate this numerically on both simple and more realistic examples, and prove that both algorithms have monotonically decreasing update rules.
    Computing with Residue Numbers in High-Dimensional Representation. (arXiv:2311.04872v1 [cs.NE])
    We introduce Residue Hyperdimensional Computing, a computing framework that unifies residue number systems with an algebra defined over random, high-dimensional vectors. We show how residue numbers can be represented as high-dimensional vectors in a manner that allows algebraic operations to be performed with component-wise, parallelizable operations on the vector elements. The resulting framework, when combined with an efficient method for factorizing high-dimensional vectors, can represent and operate on numerical values over a large dynamic range using vastly fewer resources than previous methods, and it exhibits impressive robustness to noise. We demonstrate the potential for this framework to solve computationally difficult problems in visual perception and combinatorial optimization, showing improvement over baseline methods. More broadly, the framework provides a possible account for the computational operations of grid cells in the brain, and it suggests new machine learning architectures for representing and manipulating numerical data.
    LoopTune: Optimizing Tensor Computations with Reinforcement Learning. (arXiv:2309.01825v3 [cs.LG] UPDATED)
    Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds.
    Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning. (arXiv:2305.18869v2 [cs.LG] UPDATED)
    Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we find that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on and filtering data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating additional layers that perform the necessary data-filtering for CoT via the attention mechanism. In addition to these test-time benefits, we show CoT helps accelerate pretraining by learning shortcuts to represent complex functions and filtering plays an important role in this process. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks.
    N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics. (arXiv:2310.18679v2 [cs.CL] UPDATED)
    We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.
    CoCA: Fusing position embedding with Collinear Constrained Attention for fine-tuning free context window extending. (arXiv:2309.08646v2 [cs.LG] UPDATED)
    Self-attention and position embedding are two key modules in Transformer based LLMs. The potential relationship among them are far from well studied, especially for context window extending. In this paper, we introduce collinear constrained relationship to fuse RoPE and self-attention, and name it as Collinear Constrained Attention (CoCA). We've analyzed the computational and spatial complexity of CoCA and have determined that it adds only minimal additional overhead compared to the original Transformer-based models. We provide an efficient implementation of CoCA, and make it drop-in replacement for any existing position embedding and attention modules in Transformer based models. Experiments show that CoCA performs extraordinary well on context window extending. For instance, a CoCA based GPT model trained with 512 context length can extend the context window up to 8K without perplexity diverging. This indicates more than 16x context window extending without any fine-tuning. Our code is released here: https://github.com/codefuse-ai/Collinear-Constrained-Attention
    Vital Sign Forecasting for Sepsis Patients in ICUs. (arXiv:2311.04770v1 [cs.LG])
    Sepsis and septic shock are a critical medical condition affecting millions globally, with a substantial mortality rate. This paper uses state-of-the-art deep learning (DL) architectures to introduce a multi-step forecasting system to predict vital signs indicative of septic shock progression in Intensive Care Units (ICUs). Our approach utilizes a short window of historical vital sign data to forecast future physiological conditions. We introduce a DL-based vital sign forecasting system that predicts up to 3 hours of future vital signs from 6 hours of past data. We further adopt the DILATE loss function to capture better the shape and temporal dynamics of vital signs, which are critical for clinical decision-making. We compare three DL models, N-BEATS, N-HiTS, and Temporal Fusion Transformer (TFT), using the publicly available eICU Collaborative Research Database (eICU-CRD), highlighting their forecasting capabilities in a critical care setting. We evaluate the performance of our models using mean squared error (MSE) and dynamic time warping (DTW) metrics. Our findings show that while TFT excels in capturing overall trends, N-HiTS is superior in retaining short-term fluctuations within a predefined range. This paper demonstrates the potential of deep learning in transforming the monitoring systems in ICUs, potentially leading to significant improvements in patient care and outcomes by accurately forecasting vital signs to assist healthcare providers in detecting early signs of physiological instability and anticipating septic shock.
    Robust Mean Estimation Without Moments for Symmetric Distributions. (arXiv:2302.10844v2 [cs.DS] UPDATED)
    We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions. For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-\delta$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/\delta)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions. Our algorithms are based on a generalization of the well-known filtering technique. We show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how SoS proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works.
    Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace. (arXiv:2305.15775v3 [cs.LG] UPDATED)
    Many interpretable AI approaches have been proposed to provide plausible explanations for a model's decision-making. However, configuring an explainable model that effectively communicates among computational modules has received less attention. A recently proposed shared global workspace theory showed that networks of distributed modules can benefit from sharing information with a bottlenecked memory because the communication constraints encourage specialization, compositionality, and synchronization among the modules. Inspired by this, we propose Concept-Centric Transformers, a simple yet effective configuration of the shared global workspace for interpretability, consisting of: i) an object-centric-based memory module for extracting semantic concepts from input features, ii) a cross-attention mechanism between the learned concept and input embeddings, and iii) standard classification and explanation losses to allow human analysts to directly assess an explanation for the model's classification reasoning. We test our approach against other existing concept-based methods on classification tasks for various datasets, including CIFAR100, CUB-200-2011, and ImageNet, and we show that our model achieves better classification accuracy than all baselines across all problems but also generates more consistent concept-based explanations of classification output.
    Zero-Shot Anomaly Detection via Batch Normalization. (arXiv:2302.07849v4 [cs.LG] UPDATED)
    Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal," has led to the development of zero-shot AD techniques. In this paper, we propose a simple yet effective method called Adaptive Centered Representations (ACR) for zero-shot batch-level AD. Our approach trains off-the-shelf deep anomaly detectors (such as deep SVDD) to adapt to a set of inter-related training data distributions in combination with batch normalization, enabling automatic zero-shot generalization for unseen AD tasks. This simple recipe, batch normalization plus meta-training, is a highly effective and versatile tool. Our theoretical results guarantee the zero-shot generalization for unseen AD tasks; our empirical results demonstrate the first zero-shot AD results for tabular data and outperform existing methods in zero-shot anomaly detection and segmentation on image data from specialized domains. Code is at https://github.com/aodongli/zero-shot-ad-via-batch-norm
    Survival Instinct in Offline Reinforcement Learning. (arXiv:2306.03286v2 [cs.LG] UPDATED)
    We present a novel observation about the behavior of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a "survival instinct", i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies. Formally, given a reward class -- which may not even contain the true reward -- we identify conditions on the training data distribution that enable offline RL to learn a near-optimal and safe policy from any reward within the class. We argue that the survival instinct should be taken into account when interpreting results from existing offline RL benchmarks and when creating future ones. Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.
    Federated Dataset Dictionary Learning for Multi-Source Domain Adaptation. (arXiv:2309.07670v2 [cs.LG] UPDATED)
    In this article, we propose an approach for federated domain adaptation, a setting where distributional shift exists among clients and some have unlabeled data. The proposed framework, FedDaDiL, tackles the resulting challenge through dictionary learning of empirical distributions. In our setting, clients' distributions represent particular domains, and FedDaDiL collectively trains a federated dictionary of empirical distributions. In particular, we build upon the Dataset Dictionary Learning framework by designing collaborative communication protocols and aggregation operations. The chosen protocols keep clients' data private, thus enhancing overall privacy compared to its centralized counterpart. We empirically demonstrate that our approach successfully generates labeled data on the target domain with extensive experiments on (i) Caltech-Office, (ii) TEP, and (iii) CWRU benchmarks. Furthermore, we compare our method to its centralized counterpart and other benchmarks in federated domain adaptation.
    Cross-Silo Federated Learning Across Divergent Domains with Iterative Parameter Alignment. (arXiv:2311.04818v1 [cs.LG])
    Learning from the collective knowledge of data dispersed across private sources can provide neural networks with enhanced generalization capabilities. Federated learning, a method for collaboratively training a machine learning model across remote clients, achieves this by combining client models via the orchestration of a central server. However, current approaches face two critical limitations: i) they struggle to converge when client domains are sufficiently different, and ii) current aggregation techniques produce an identical global model for each client. In this work, we address these issues by reformulating the typical federated learning setup: rather than learning a single global model, we learn N models each optimized for a common objective. To achieve this, we apply a weighted distance minimization to model parameters shared in a peer-to-peer topology. The resulting framework, Iterative Parameter Alignment, applies naturally to the cross-silo setting, and has the following properties: (i) a unique solution for each participant, with the option to globally converge each model in the federation, and (ii) an optional early-stopping mechanism to elicit fairness among peers in collaborative learning settings. These characteristics jointly provide a flexible new framework for iteratively learning from peer models trained on disparate datasets. We find that the technique achieves competitive results on a variety of data partitions compared to state-of-the-art approaches. Further, we show that the method is robust to divergent domains (i.e. disjoint classes across peers) where existing approaches struggle.
    Foundation Models for Generalist Geospatial Artificial Intelligence. (arXiv:2310.18660v2 [cs.CV] UPDATED)
    Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.
    ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language Understanding. (arXiv:2308.16336v2 [cs.CL] UPDATED)
    We present ToddlerBERTa, a BabyBERTa-like language model, exploring its capabilities through five different models with varied hyperparameters. Evaluating on BLiMP, SuperGLUE, MSGS, and a Supplement benchmark from the BabyLM challenge, we find that smaller models can excel in specific tasks, while larger models perform well with substantial data. Despite training on a smaller dataset, ToddlerBERTa demonstrates commendable performance, rivalling the state-of-the-art RoBERTa-base. The model showcases robust language understanding, even with single-sentence pretraining, and competes with baselines that leverage broader contextual information. Our work provides insights into hyperparameter choices, and data utilization, contributing to the advancement of language models.
    Causal disentanglement of multimodal data. (arXiv:2310.18471v2 [cs.LG] UPDATED)
    Causal representation learning algorithms discover lower-dimensional representations of data that admit a decipherable interpretation of cause and effect; as achieving such interpretable representations is challenging, many causal learning algorithms utilize elements indicating prior information, such as (linear) structural causal models, interventional data, or weak supervision. Unfortunately, in exploratory causal representation learning, such elements and prior information may not be available or warranted. Alternatively, scientific datasets often have multiple modalities or physics-based constraints, and the use of such scientific, multimodal data has been shown to improve disentanglement in fully unsupervised settings. Consequently, we introduce a causal representation learning algorithm (causalPIMA) that can use multimodal data and known physics to discover important features with causal relationships. Our innovative algorithm utilizes a new differentiable parametrization to learn a directed acyclic graph (DAG) together with a latent space of a variational autoencoder in an end-to-end differentiable framework via a single, tractable evidence lower bound loss function. We place a Gaussian mixture prior on the latent space and identify each of the mixtures with an outcome of the DAG nodes; this novel identification enables feature discovery with causal relationships. Tested against a synthetic and a scientific dataset, our results demonstrate the capability of learning an interpretable causal structure while simultaneously discovering key features in a fully unsupervised setting.
    Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data. (arXiv:2311.04829v1 [cs.LG])
    Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.
    Real-Time Recurrent Reinforcement Learning. (arXiv:2311.04830v1 [cs.LG])
    Recent advances in reinforcement learning, for partially-observable Markov decision processes (POMDPs), rely on the biologically implausible backpropagation through time algorithm (BPTT) to perform gradient-descent optimisation. In this paper we propose a novel reinforcement learning algorithm that makes use of random feedback local online learning (RFLO), a biologically plausible approximation of realtime recurrent learning (RTRL) to compute the gradients of the parameters of a recurrent neural network in an online manner. By combining it with TD($\lambda$), a variant of temporaldifference reinforcement learning with eligibility traces, we create a biologically plausible, recurrent actor-critic algorithm, capable of solving discrete and continuous control tasks in POMDPs. We compare BPTT, RTRL and RFLO as well as different network architectures, and find that RFLO can perform just as well as RTRL while exceeding even BPTT in terms of complexity. The proposed method, called real-time recurrent reinforcement learning (RTRRL), serves as a model of learning in biological neural networks mimicking reward pathways in the mammalian brain.
    FetMRQC: Automated Quality Control for fetal brain MRI. (arXiv:2304.05879v2 [eess.IV] UPDATED)
    Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where large and unpredictable fetal motion can lead to substantial artifacts in the acquired images. Existing methods for fetal brain quality assessment operate at the \textit{slice} level, and fail to get a comprehensive picture of the quality of an image, that can only be achieved by looking at the \textit{entire} brain volume. In this work, we propose FetMRQC, a machine learning framework for automated image quality assessment tailored to fetal brain MRI, which extracts an ensemble of quality metrics that are then used to predict experts' ratings. Based on the manual ratings of more than 1000 low-resolution stacks acquired across two different institutions, we show that, compared with existing quality metrics, FetMRQC is able to generalize out-of-domain, while being interpretable and data efficient. We also release a novel manual quality rating tool designed to facilitate and optimize quality rating of fetal brain images. Our tool, along with all the code to generate, train and evaluate the model is available at https://github.com/Medical-Image-Analysis-Laboratory/fetal_brain_qc/ .
    Constrained Adaptive Attacks: Realistic Evaluation of Adversarial Examples and Robust Training of Deep Neural Networks for Tabular Data. (arXiv:2311.04503v1 [cs.LG])
    State-of-the-art deep learning models for tabular data have recently achieved acceptable performance to be deployed in industrial settings. However, the robustness of these models remains scarcely explored. Contrary to computer vision, there is to date no realistic protocol to properly evaluate the adversarial robustness of deep tabular models due to intrinsic properties of tabular data such as categorical features, immutability, and feature relationship constraints. To fill this gap, we propose CAA, the first efficient evasion attack for constrained tabular deep learning models. CAA is an iterative parameter-free attack that combines gradient and search attacks to generate adversarial examples under constraints. We leverage CAA to build a benchmark of deep tabular models across three popular use cases: credit scoring, phishing and botnet attacks detection. Our benchmark supports ten threat models with increasing capabilities of the attacker, and reflects real-world attack scenarios for each use case. Overall, our results demonstrate how domain knowledge, adversarial training, and attack budgets impact the robustness assessment of deep tabular models and provide security practitioners with a set of recommendations to improve the robustness of deep tabular models against various evasion attack scenarios.
    When to Update Your Model: Constrained Model-based Reinforcement Learning. (arXiv:2210.08349v4 [cs.LG] UPDATED)
    Designing and analyzing model-based RL (MBRL) algorithms with guaranteed monotonic improvement has been challenging, mainly due to the interdependence between policy optimization and model learning. Existing discrepancy bounds generally ignore the impacts of model shifts, and their corresponding algorithms are prone to degrade performance by drastic model updating. In this work, we first propose a novel and general theoretical scheme for a non-decreasing performance guarantee of MBRL. Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. These discoveries encourage us to formulate a constrained lower-bound optimization problem to permit the monotonicity of MBRL. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns. Motivated by these analyses, we design a simple but effective algorithm CMLO (Constrained Model-shift Lower-bound Optimization), by introducing an event-triggered mechanism that flexibly determines when to update the model. Experiments show that CMLO surpasses other state-of-the-art methods and produces a boost when various policy optimization methods are employed.
    Efficient and Equivariant Graph Networks for Predicting Quantum Hamiltonian. (arXiv:2306.04922v2 [cs.LG] UPDATED)
    We consider the prediction of the Hamiltonian matrix, which finds use in quantum chemistry and condensed matter physics. Efficiency and equivariance are two important, but conflicting factors. In this work, we propose a SE(3)-equivariant network, named QHNet, that achieves efficiency and equivariance. Our key advance lies at the innovative design of QHNet architecture, which not only obeys the underlying symmetries, but also enables the reduction of number of tensor products by 92\%. In addition, QHNet prevents the exponential growth of channel dimension when more atom types are involved. We perform experiments on MD17 datasets, including four molecular systems. Experimental results show that our QHNet can achieve comparable performance to the state of the art methods at a significantly faster speed. Besides, our QHNet consumes 50\% less memory due to its streamlined architecture. Our code is publicly available as part of the AIRS library (\url{https://github.com/divelab/AIRS}).
    The voraus-AD Dataset for Anomaly Detection in Robot Applications. (arXiv:2311.04765v1 [cs.RO])
    During the operation of industrial robots, unusual events may endanger the safety of humans and the quality of production. When collecting data to detect such cases, it is not ensured that data from all potentially occurring errors is included as unforeseeable events may happen over time. Therefore, anomaly detection (AD) delivers a practical solution, using only normal data to learn to detect unusual events. We introduce a dataset that allows training and benchmarking of anomaly detection methods for robotic applications based on machine data which will be made publicly available to the research community. As a typical robot task the dataset includes a pick-and-place application which involves movement, actions of the end effector and interactions with the objects of the environment. Since several of the contained anomalies are not task-specific but general, evaluations on our dataset are transferable to other robotics applications as well. Additionally, we present MVT-Flow (multivariate time-series flow) as a new baseline method for anomaly detection: It relies on deep-learning-based density estimation with normalizing flows, tailored to the data domain by taking its structure into account for the architecture. Our evaluation shows that MVT-Flow outperforms baselines from previous work by a large margin of 6.2% in area under ROC.
    Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding. (arXiv:2209.05629v2 [cs.RO] UPDATED)
    Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.
    Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning. (arXiv:2309.06597v2 [cs.CV] UPDATED)
    The widespread adoption of commercial autonomous vehicles (AVs) and advanced driver assistance systems (ADAS) may largely depend on their acceptance by society, for which their perceived trustworthiness and interpretability to riders are crucial. In general, this task is challenging because modern autonomous systems software relies heavily on black-box artificial intelligence models. Towards this goal, this paper introduces a novel dataset, Rank2Tell, a multi-modal ego-centric dataset for Ranking the importance level and Telling the reason for the importance. Using various close and open-ended visual question answering, the dataset provides dense annotations of various semantic, spatial, temporal, and relational attributes of various important objects in complex traffic scenarios. The dense annotations and unique attributes of the dataset make it a valuable resource for researchers working on visual scene understanding and related fields. Furthermore, we introduce a joint model for joint importance level ranking and natural language captions generation to benchmark our dataset and demonstrate performance with quantitative evaluations.
    Riemannian Laplace Approximation with the Fisher Metric. (arXiv:2311.02766v2 [cs.LG] UPDATED)
    The Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties heavily depend on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.
    Versatile Energy-Based Probabilistic Models for High Energy Physics. (arXiv:2302.00695v4 [cs.LG] UPDATED)
    As a classical generative modeling approach, energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In line with these advancements, we build a multi-purpose energy-based probabilistic model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator for physics simulation, a generic anomalous signal detector free from spurious correlations, and an augmented event classifier for particle identification.
    Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space. (arXiv:2307.14953v3 [cs.LG] UPDATED)
    This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.
    Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs. (arXiv:2311.04417v1 [cs.AR])
    The relentless advancement of artificial intelligence (AI) and machine learning (ML) applications necessitates the development of specialized hardware accelerators capable of handling the increasing complexity and computational demands. Traditional computing architectures, based on the von Neumann model, are being outstripped by the requirements of contemporary AI/ML algorithms, leading to a surge in the creation of accelerators like the Graphcore Intelligence Processing Unit (IPU), Sambanova Reconfigurable Dataflow Unit (RDU), and enhanced GPU platforms. These hardware accelerators are characterized by their innovative data-flow architectures and other design optimizations that promise to deliver superior performance and energy efficiency for AI/ML tasks. This research provides a preliminary evaluation and comparison of these commercial AI/ML accelerators, delving into their hardware and software design features to discern their strengths and unique capabilities. By conducting a series of benchmark evaluations on common DNN operators and other AI/ML workloads, we aim to illuminate the advantages of data-flow architectures over conventional processor designs and offer insights into the performance trade-offs of each platform. The findings from our study will serve as a valuable reference for the design and performance expectations of research prototypes, thereby facilitating the development of next-generation hardware accelerators tailored for the ever-evolving landscape of AI/ML applications. Through this analysis, we aspire to contribute to the broader understanding of current accelerator technologies and to provide guidance for future innovations in the field.
    Bandit Learning to Rank with Position-Based Click Models: Personalized and Equal Treatments. (arXiv:2311.04528v1 [cs.LG])
    Online learning to rank (ONL2R) is a foundational problem for recommender systems and has received increasing attention in recent years. Among the existing approaches for ONL2R, a natural modeling architecture is the multi-armed bandit framework coupled with the position-based click model. However, developing efficient online learning policies for MAB-based ONL2R with position-based click models is highly challenging due to the combinatorial nature of the problem, and partial observability in the position-based click model. To date, results in MAB-based ONL2R with position-based click models remain rather limited, which motivates us to fill this gap in this work. Our main contributions in this work are threefold: i) We propose the first general MAB framework that captures all key ingredients of ONL2R with position-based click models. Our model considers personalized and equal treatments in ONL2R ranking recommendations, both of which are widely used in practice; ii) Based on the above analytical framework, we develop two unified greed- and UCB-based policies called GreedyRank and UCBRank, each of which can be applied to personalized and equal ranking treatments; and iii) We show that both GreedyRank and UCBRank enjoy $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regret for personalized and equal treatment, respectively. For the fundamentally hard equal ranking treatment, we identify classes of collective utility functions and their associated sufficient conditions under which $O(\sqrt{t}\ln t)$ and $O(\sqrt{t\ln t})$ anytime sublinear regrets are still achievable for GreedyRank and UCBRank, respectively. Our numerical experiments also verify our theoretical results and demonstrate the efficiency of GreedyRank and UCBRank in seeking the optimal action under various problem settings.
    LuminanceL1Loss: A loss function which measures percieved brightness and colour differences. (arXiv:2311.04614v1 [cs.CV])
    We introduce LuminanceL1Loss, a novel loss function designed to enhance the performance of image restoration tasks. We demonstrate its superiority over MSE when applied to the Retinexformer, BUIFD and DnCNN architectures. Our proposed LuminanceL1Loss leverages a unique approach by transforming images into grayscale and subsequently computing the MSE loss for both grayscale and color channels. Experimental results demonstrate that this innovative loss function consistently outperforms traditional methods, showcasing its potential in image denoising and other related tasks in image reconstruction. It demonstrates gains up to 4.7dB. The results presented in this study highlight the efficacy of LuminanceL1Loss for various image restoration tasks.
    Sample Efficient Reinforcement Learning in Mixed Systems through Augmented Samples and Its Applications to Queueing Networks. (arXiv:2305.16483v2 [cs.LG] UPDATED)
    This paper considers a class of reinforcement learning problems, which involve systems with two types of states: stochastic and pseudo-stochastic. In such systems, stochastic states follow a stochastic transition kernel while the transitions of pseudo-stochastic states are deterministic given the stochastic states/transitions. We refer to such systems as mixed systems, which are widely used in various applications, including manufacturing systems, communication networks, and queueing networks. We propose a sample efficient RL method that accelerates learning by generating augmented data samples. The proposed algorithm is data-driven and learns the policy from data samples from both real and augmented samples. This method significantly improves learning by reducing the sample complexity such that the dataset only needs to have sufficient coverage of the stochastic states. We analyze the sample complexity of the proposed method under Fitted Q Iteration (FQI) and demonstrate that the optimality gap decreases as $\tilde{\mathcal{O}}(\sqrt{{1}/{n}}+\sqrt{{1}/{m}}),$ where $n$ is the number of real samples and $m$ is the number of augmented samples per real sample. It is important to note that without augmented samples, the optimality gap is $\tilde{\mathcal{O}}(1)$ due to insufficient data coverage of the pseudo-stochastic states. Our experimental results on multiple queueing network applications confirm that the proposed method indeed significantly accelerates learning in both deep Q-learning and deep policy gradient.
    Blind Federated Learning via Over-the-Air q-QAM. (arXiv:2311.04253v1 [eess.SP])
    In this work, we investigate federated edge learning over a fading multiple access channel. To alleviate the communication burden between the edge devices and the access point, we introduce a pioneering digital over-the-air computation strategy employing q-ary quadrature amplitude modulation, culminating in a low latency communication scheme. Indeed, we propose a new federated edge learning framework in which edge devices use digital modulation for over-the-air uplink transmission to the edge server while they have no access to the channel state information. Furthermore, we incorporate multiple antennas at the edge server to overcome the fading inherent in wireless communication. We analyze the number of antennas required to mitigate the fading impact effectively. We prove a non-asymptotic upper bound for the mean squared error for the proposed federated learning with digital over-the-air uplink transmissions under both noisy and fading conditions. Leveraging the derived upper bound, we characterize the convergence rate of the learning process of a non-convex loss function in terms of the mean square error of gradients due to the fading channel. Furthermore, we substantiate the theoretical assurances through numerical experiments concerning mean square error and the convergence efficacy of the digital federated edge learning framework. Notably, the results demonstrate that augmenting the number of antennas at the edge server and adopting higher-order modulations improve the model accuracy up to 60\%.
    Enhancing Multi-Agent Coordination through Common Operating Picture Integration. (arXiv:2311.04740v1 [cs.MA])
    In multi-agent systems, agents possess only local observations of the environment. Communication between teammates becomes crucial for enhancing coordination. Past research has primarily focused on encoding local information into embedding messages which are unintelligible to humans. We find that using these messages in agent's policy learning leads to brittle policies when tested on out-of-distribution initial states. We present an approach to multi-agent coordination, where each agent is equipped with the capability to integrate its (history of) observations, actions and messages received into a Common Operating Picture (COP) and disseminate the COP. This process takes into account the dynamic nature of the environment and the shared mission. We conducted experiments in the StarCraft2 environment to validate our approach. Our results demonstrate the efficacy of COP integration, and show that COP-based training leads to robust policies compared to state-of-the-art Multi-Agent Reinforcement Learning (MARL) methods when faced with out-of-distribution initial states.
    Question Answering for Electronic Health Records: A Scoping Review of datasets and models. (arXiv:2310.08759v2 [cs.LG] UPDATED)
    Question Answering (QA) systems on patient-related data can assist both clinicians and patients. They can, for example, assist clinicians in decision-making and enable patients to have a better understanding of their medical history. Significant amounts of patient data are stored in Electronic Health Records (EHRs), making EHR QA an important research area. In EHR QA, the answer is obtained from the medical record of the patient. Because of the differences in data format and modality, this differs greatly from other medical QA tasks that employ medical websites or scientific papers to retrieve answers, making it critical to research EHR question answering. This study aimed to provide a methodological review of existing works on QA over EHRs. We searched for articles from January 1st, 2005 to September 30th, 2023 in four digital sources including Google Scholar, ACL Anthology, ACM Digital Library, and PubMed to collect relevant publications on EHR QA. 4111 papers were identified for our study, and after screening based on our inclusion criteria, we obtained a total of 47 papers for further study. Out of the 47 papers, 25 papers were about EHR QA datasets, and 37 papers were about EHR QA models. It was observed that QA on EHRs is relatively new and unexplored. Most of the works are fairly recent. Also, it was observed that emrQA is by far the most popular EHR QA dataset, both in terms of citations and usage in other papers. Furthermore, we identified the different models used in EHR QA along with the evaluation metrics used for these models.
    Enhancing Malware Detection by Integrating Machine Learning with Cuckoo Sandbox. (arXiv:2311.04372v1 [cs.CR])
    In the modern era, malware is experiencing a significant increase in both its variety and quantity, aligning with the widespread adoption of the digital world. This surge in malware has emerged as a critical challenge in the realm of cybersecurity, prompting numerous research endeavors and contributions to address the issue. Machine learning algorithms have been leveraged for malware detection due to their ability to uncover concealed patterns within vast datasets. However, deep learning algorithms, characterized by their multi-layered structure, surpass the limitations of traditional machine learning approaches. By employing deep learning techniques such as CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), this study aims to classify and identify malware extracted from a dataset containing API call sequences. The performance of these algorithms is compared with that of conventional machine learning methods, including SVM (Support Vector Machine), RF (Random Forest), KNN (K-Nearest Neighbors), XGB (Extreme Gradient Boosting), and GBC (Gradient Boosting Classifier), all using the same dataset. The outcomes of this research demonstrate that both deep learning and machine learning algorithms achieve remarkably high levels of accuracy, reaching up to 99% in certain cases.
    Implementation of Trained Factorization Machine Recommendation System on Quantum Annealer. (arXiv:2210.12953v2 [quant-ph] UPDATED)
    Factorization Machine (FM) is the most commonly used model to build a recommendation system since it can incorporate side information to improve performance. However, producing item suggestions for a given user with a trained FM is time-consuming. It requires a run-time of $O((N_m \log N_m)^2)$, where $N_m$ is the number of items in the dataset. To address this problem, we propose a quadratic unconstrained binary optimization (QUBO) scheme to combine with FM and apply quantum annealing (QA) computation. Compared to classical methods, this hybrid algorithm provides a faster than quadratic speedup in finding good user suggestions. We then demonstrate the aforementioned computational advantage on current NISQ hardware by experimenting with a real example on a D-Wave annealer.
    Hierarchical clustering with dot products recovers hidden tree structure. (arXiv:2305.15022v2 [stat.ML] UPDATED)
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
    Improving Fairness in Deepfake Detection. (arXiv:2306.16635v3 [cs.CV] UPDATED)
    Despite the development of effective deepfake detectors in recent years, recent studies have demonstrated that biases in the data used to train these detectors can lead to disparities in detection accuracy across different races and genders. This can result in different groups being unfairly targeted or excluded from detection, allowing undetected deepfakes to manipulate public opinion and erode trust in a deepfake detection model. While existing studies have focused on evaluating fairness of deepfake detectors, to the best of our knowledge, no method has been developed to encourage fairness in deepfake detection at the algorithm level. In this work, we make the first attempt to improve deepfake detection fairness by proposing novel loss functions that handle both the setting where demographic information (eg, annotations of race and gender) is available as well as the case where this information is absent. Fundamentally, both approaches can be used to convert many existing deepfake detectors into ones that encourages fairness. Extensive experiments on four deepfake datasets and five deepfake detectors demonstrate the effectiveness and flexibility of our approach in improving deepfake detection fairness. Our code is available at https://github.com/littlejuyan/DF_Fairness.
    Kindness in Multi-Agent Reinforcement Learning. (arXiv:2311.04239v1 [cs.AI])
    In human societies, people often incorporate fairness in their decisions and treat reciprocally by being kind to those who act kindly. They evaluate the kindness of others' actions not only by monitoring the outcomes but also by considering the intentions. This behavioral concept can be adapted to train cooperative agents in Multi-Agent Reinforcement Learning (MARL). We propose the KindMARL method, where agents' intentions are measured by counterfactual reasoning over the environmental impact of the actions that were available to the agents. More specifically, the current environment state is compared with the estimation of the current environment state provided that the agent had chosen another action. The difference between each agent's reward, as the outcome of its action, with that of its fellow, multiplied by the intention of the fellow is then taken as the fellow's "kindness". If the result of each reward-comparison confirms the agent's superiority, it perceives the fellow's kindness and reduces its own reward. Experimental results in the Cleanup and Harvest environments show that training based on the KindMARL method enabled the agents to earn 89\% (resp. 37\%) and 44% (resp. 43\%) more total rewards than training based on the Inequity Aversion and Social Influence methods. The effectiveness of KindMARL is further supported by experiments in a traffic light control problem.
    Physics informed machine learning with Smoothed Particle Hydrodynamics: Hierarchy of reduced Lagrangian models of turbulence. (arXiv:2110.13311v7 [physics.flu-dyn] UPDATED)
    Building efficient, accurate and generalizable reduced order models of developed turbulence remains a major challenge. This manuscript approaches this problem by developing a hierarchy of parameterized reduced Lagrangian models for turbulent flows, and investigates the effects of enforcing physical structure through Smoothed Particle Hydrodynamics (SPH) versus relying on neural networks (NN)s as universal function approximators. Starting from Neural Network (NN) parameterizations of a Lagrangian acceleration operator, this hierarchy of models gradually incorporates a weakly compressible and parameterized SPH framework, which enforces physical symmetries, such as Galilean, rotational and translational invariances. Within this hierarchy, two new parameterized smoothing kernels are developed in order to increase the flexibility of the learn-able SPH simulators. For each model we experiment with different loss functions which are minimized using gradient based optimization, where efficient computations of gradients are obtained by using Automatic Differentiation (AD) and Sensitivity Analysis (SA). Each model within the hierarchy is trained on two data sets associated with weekly compressible Homogeneous Isotropic Turbulence (HIT): (1) a validation set using weakly compressible SPH; and (2) a high fidelity set from Direct Numerical Simulations (DNS). Numerical evidence shows that encoding more SPH structure improves generalizability to different turbulent Mach numbers and time shifts, and that including the novel parameterized smoothing kernels improves the accuracy of SPH at the resolved scales.
    Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. (arXiv:2210.09186v7 [cs.SI] UPDATED)
    The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
    PTW: Pivotal Tuning Watermarking for Pre-Trained Image Generators. (arXiv:2304.07361v3 [cs.LG] UPDATED)
    Deepfakes refer to content synthesized using deep generators, which, when misused, have the potential to erode trust in digital media. Synthesizing high-quality deepfakes requires access to large and complex generators only a few entities can train and provide. The threat is malicious users that exploit access to the provided model and generate harmful deepfakes without risking detection. Watermarking makes deepfakes detectable by embedding an identifiable code into the generator that is later extractable from its generated images. We propose Pivotal Tuning Watermarking (PTW), a method for watermarking pre-trained generators (i) three orders of magnitude faster than watermarking from scratch and (ii) without the need for any training data. We improve existing watermarking methods and scale to generators $4 \times$ larger than related work. PTW can embed longer codes than existing methods while better preserving the generator's image quality. We propose rigorous, game-based definitions for robustness and undetectability, and our study reveals that watermarking is not robust against an adaptive white-box attacker who controls the generator's parameters. We propose an adaptive attack that can successfully remove any watermarking with access to only 200 non-watermarked images. Our work challenges the trustworthiness of watermarking for deepfake detection when the parameters of a generator are available. The source code to reproduce our experiments is available at https://github.com/nilslukas/gan-watermark.
    FEIR: Quantifying and Reducing Envy and Inferiority for Fair Recommendation of Limited Resources. (arXiv:2311.04542v1 [cs.IR])
    In settings such as e-recruitment and online dating, recommendation involves distributing limited opportunities, calling for novel approaches to quantify and enforce fairness. We introduce \emph{inferiority}, a novel (un)fairness measure quantifying a user's competitive disadvantage for their recommended items. Inferiority complements \emph{envy}, a fairness notion measuring preference for others' recommendations. We combine inferiority and envy with \emph{utility}, an accuracy-related measure of aggregated relevancy scores. Since these measures are non-differentiable, we reformulate them using a probabilistic interpretation of recommender systems, yielding differentiable versions. We combine these loss functions in a multi-objective optimization problem called \texttt{FEIR} (Fairness through Envy and Inferiority Reduction), applied as post-processing for standard recommender systems. Experiments on synthetic and real-world data demonstrate that our approach improves trade-offs between inferiority, envy, and utility compared to naive recommendations and the baseline methods.
    DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets. (arXiv:2311.04894v1 [cs.CV])
    Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse.
    Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments. (arXiv:2309.06183v2 [eess.AS] UPDATED)
    The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.
    Byzantine-Tolerant Methods for Distributed Variational Inequalities. (arXiv:2311.04611v1 [cs.LG])
    Robustness to Byzantine attacks is a necessity for various distributed training scenarios. When the training reduces to the process of solving a minimization problem, Byzantine robustness is relatively well-understood. However, other problem formulations, such as min-max problems or, more generally, variational inequalities, arise in many modern machine learning and, in particular, distributed learning tasks. These problems significantly differ from the standard minimization ones and, therefore, require separate consideration. Nevertheless, only one work (Adibi et al., 2022) addresses this important question in the context of Byzantine robustness. Our work makes a further step in this direction by providing several (provably) Byzantine-robust methods for distributed variational inequality, thoroughly studying their theoretical convergence, removing the limitations of the previous work, and providing numerical comparisons supporting the theoretical findings.
    Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation when the SCAR assumption does not hold. (arXiv:2303.08269v2 [cs.LG] UPDATED)
    Positive and Unlabeled (PU) learning is a type of semi-supervised binary classification where the machine learning algorithm differentiates between a set of positive instances (labeled) and a set of both positive and negative instances (unlabeled). PU learning has broad applications in settings where confirmed negatives are unavailable or difficult to obtain, and there is value in discovering positives among the unlabeled (e.g., viable drugs among untested compounds). Most PU learning algorithms make the selected completely at random (SCAR) assumption, namely that positives are selected independently of their features. However, in many real-world applications, such as healthcare, positives are not SCAR (e.g., severe cases are more likely to be diagnosed), leading to a poor estimate of the proportion, $\alpha$, of positives among unlabeled examples and poor model calibration, resulting in an uncertain decision threshold for selecting positives. PU learning algorithms can estimate $\alpha$ or the probability of an individual unlabeled instance being positive or both. We propose two PU learning algorithms to estimate $\alpha$, calculate calibrated probabilities for PU instances, and improve classification metrics: i) PULSCAR (positive unlabeled learning selected completely at random), and ii) PULSNAR (positive unlabeled learning selected not at random). PULSNAR uses a divide-and-conquer approach that creates and solves several SCAR-like sub-problems using PULSCAR. In our experiments, PULSNAR outperformed state-of-the-art approaches on both synthetic and real-world benchmark datasets.
    GPT-ST: Generative Pre-Training of Spatio-Temporal Graph Neural Networks. (arXiv:2311.04245v1 [cs.LG])
    In recent years, there has been a rapid development of spatio-temporal prediction techniques in response to the increasing demands of traffic management and travel planning. While advanced end-to-end models have achieved notable success in improving predictive performance, their integration and expansion pose significant challenges. This work aims to address these challenges by introducing a spatio-temporal pre-training framework that seamlessly integrates with downstream baselines and enhances their performance. The framework is built upon two key designs: (i) We propose a spatio-temporal mask autoencoder as a pre-training model for learning spatio-temporal dependencies. The model incorporates customized parameter learners and hierarchical spatial pattern encoding networks. These modules are specifically designed to capture spatio-temporal customized representations and intra- and inter-cluster region semantic relationships, which have often been neglected in existing approaches. (ii) We introduce an adaptive mask strategy as part of the pre-training mechanism. This strategy guides the mask autoencoder in learning robust spatio-temporal representations and facilitates the modeling of different relationships, ranging from intra-cluster to inter-cluster, in an easy-to-hard training manner. Extensive experiments conducted on representative benchmarks demonstrate the effectiveness of our proposed method. We have made our model implementation publicly available at https://github.com/HKUDS/GPT-ST.
    Designing Robust Transformers using Robust Kernel Density Estimation. (arXiv:2210.05794v3 [cs.LG] UPDATED)
    Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.
    Environmental-Impact Based Multi-Agent Reinforcement Learning. (arXiv:2311.04240v1 [cs.AI])
    To promote cooperation and strengthen the individual impact on the collective outcome in social dilemmas, we propose the Environmental-impact Multi-Agent Reinforcement Learning (EMuReL) method where each agent estimates the "environmental impact" of every other agent, that is, the difference in the current environment state compared to the hypothetical environment in the absence of that other agent. Inspired by the Inequity Aversion model, the agent then compares its own reward with those of its fellows multiplied by their environmental impacts. If its reward exceeds the scaled reward of one of its fellows, the agent takes "social responsibility" toward that fellow by reducing its own reward. Therefore, the less influential an agent is in reaching the current state, the more social responsibility is taken by other agents. Experiments in the Cleanup (resp. Harvest) test environment demonstrate that agents trained based on EMuReL learn to cooperate more effectively and obtain $54\%$ ($39\%$) and $20\%$ ($44\%$) more total rewards while preserving the same cooperation levels compared to when they are trained based on the two state-of-the-art reward reshaping methods inequity aversion and social influence.
    Why Do Clinical Probabilistic Models Fail To Transport Between Sites?. (arXiv:2311.04787v1 [cs.LG])
    The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.
    Bridging Dimensions: Confident Reachability for High-Dimensional Controllers. (arXiv:2311.04843v1 [cs.LG])
    Autonomous systems are increasingly implemented using end-end-end trained controllers. Such controllers make decisions that are executed on the real system with images as one of the primary sensing modalities. Deep neural networks form a fundamental building block of such controllers. Unfortunately, the existing neural-network verification tools do not scale to inputs with thousands of dimensions. Especially when the individual inputs (such as pixels) are devoid of clear physical meaning. This paper takes a step towards connecting exhaustive closed-loop verification with high-dimensional controllers. Our key insight is that the behavior of a high-dimensional controller can be approximated with several low-dimensional controllers in different regions of the state space. To balance approximation and verifiability, we leverage the latest verification-aware knowledge distillation. Then, if low-dimensional reachability results are inflated with statistical approximation errors, they yield a high-confidence reachability guarantee for the high-dimensional controller. We investigate two inflation techniques -- based on trajectories and actions -- both of which show convincing performance in two OpenAI gym benchmarks.
    Twitter Sentiment Analysis of Covid Vacciness. (arXiv:2311.04479v1 [cs.CL])
    In this paper, we look at a database of tweets sorted by various keywords that could indicate the users sentiment towards covid vaccines. With social media becoming such a prevalent source of opinion, sorting and ranking tweets that hold important information such as opinions on covid vaccines is of utmost importance. Two different ranking scales were used, and ranking a tweet in this way could represent the difference between an opinion being lost and an opinion being featured on the site, which affects the decisions and behavior of people, and why researchers were interested in it. Using natural language processing techniques, our aim is to determine and categorize opinions about covid vaccines with the highest accuracy possible.
    Robust and Communication-Efficient Federated Domain Adaptation via Random Features. (arXiv:2311.04686v1 [cs.LG])
    Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is \emph{independent} of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.
    Certified Data Removal from Machine Learning Models. (arXiv:1911.03030v6 [cs.LG] UPDATED)
    Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.
    MixTEA: Semi-supervised Entity Alignment with Mixture Teaching. (arXiv:2311.04441v1 [cs.LG])
    Semi-supervised entity alignment (EA) is a practical and challenging task because of the lack of adequate labeled mappings as training data. Most works address this problem by generating pseudo mappings for unlabeled entities. However, they either suffer from the erroneous (noisy) pseudo mappings or largely ignore the uncertainty of pseudo mappings. In this paper, we propose a novel semi-supervised EA method, termed as MixTEA, which guides the model learning with an end-to-end mixture teaching of manually labeled mappings and probabilistic pseudo mappings. We firstly train a student model using few labeled mappings as standard. More importantly, in pseudo mapping learning, we propose a bi-directional voting (BDV) strategy that fuses the alignment decisions in different directions to estimate the uncertainty via the joint matching confidence score. Meanwhile, we also design a matching diversity-based rectification (MDR) module to adjust the pseudo mapping learning, thus reducing the negative influence of noisy mappings. Extensive results on benchmark datasets as well as further analyses demonstrate the superiority and the effectiveness of our proposed method.
    Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies. (arXiv:2302.01734v2 [cs.LG] UPDATED)
    Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
    Fair Without Leveling Down: A New Intersectional Fairness Definition. (arXiv:2305.12495v2 [cs.LG] UPDATED)
    In this work, we consider the problem of intersectional group fairness in the classification setting, where the objective is to learn discrimination-free models in the presence of several intersecting sensitive groups. First, we illustrate various shortcomings of existing fairness measures commonly used to capture intersectional fairness. Then, we propose a new definition called the $\alpha$-Intersectional Fairness, which combines the absolute and the relative performance across sensitive groups and can be seen as a generalization of the notion of differential fairness. We highlight several desirable properties of the proposed definition and analyze its relation to other fairness measures. Finally, we benchmark multiple popular in-processing fair machine learning approaches using our new fairness definition and show that they do not achieve any improvement over a simple baseline. Our results reveal that the increase in fairness measured by previous definitions hides a "leveling down" effect, i.e., degrading the best performance over groups rather than improving the worst one.
    Can LLMs Follow Simple Rules?. (arXiv:2311.04235v1 [cs.AI])
    As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.
    Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. (arXiv:2306.15063v2 [cs.LG] UPDATED)
    Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally $\textit{new}$ tasks that are very different from those seen during pretraining? To probe this question, we examine ICL's performance on linear regression while varying the diversity of tasks in the pretraining dataset. We empirically demonstrate a $\textit{task diversity threshold}$ for the emergence of ICL. Below this threshold, the pretrained transformer cannot solve unseen regression tasks, instead behaving like a Bayesian estimator with the $\textit{non-diverse pretraining task distribution}$ as the prior. Beyond this threshold, the transformer significantly outperforms this estimator; its behavior aligns with that of ridge regression, corresponding to a Gaussian prior over $\textit{all tasks}$, including those not seen during pretraining. Thus, when pretrained on data with task diversity greater than the threshold, transformers $\textit{can}$ optimally solve fundamentally new tasks in-context. Importantly, this capability hinges on it deviating from the Bayes optimal estimator with the pretraining distribution as the prior. This study also explores the effect of regularization, model capacity and task structure and underscores, in a concrete example, the critical role of task diversity, alongside data and model scale, in the emergence of ICL. Code is available at https://github.com/mansheej/icl-task-diversity.
    Conditional Sampling of Variational Autoencoders via Iterated Approximate Ancestral Sampling. (arXiv:2308.09078v2 [cs.LG] UPDATED)
    Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get "stuck" far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.
    Physics-Informed Graph Convolutional Networks: Towards a generalized framework for complex geometries. (arXiv:2310.14948v3 [cs.LG] UPDATED)
    Since the seminal work of [9] and their Physics-Informed neural networks (PINNs), many efforts have been conducted towards solving partial differential equations (PDEs) with Deep Learning models. However, some challenges remain, for instance the extension of such models to complex three-dimensional geometries, and a study on how such approaches could be combined to classical numerical solvers. In this work, we justify the use of graph neural networks for these problems, based on the similarity between these architectures and the meshes used in traditional numerical techniques for solving partial differential equations. After proving an issue with the Physics-Informed framework for complex geometries, during the computation of PDE residuals, an alternative procedure is proposed, by combining classical numerical solvers and the Physics-Informed framework. Finally, we propose an implementation of this approach, that we test on a three-dimensional problem on an irregular geometry.
    Unifying Structure and Language Semantic for Efficient Contrastive Knowledge Graph Completion with Structured Entity Anchors. (arXiv:2311.04250v1 [cs.AI])
    The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known. In recent, pre-trained language model (PLM) based methods that utilize both textual and structural information are emerging, but their performances lag behind state-of-the-art (SOTA) structure-based methods or some methods lose their inductive inference capabilities in the process of fusing structure embedding to text encoder. In this paper, we propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning. We adopt entity anchors and these anchors and textual description of KG elements are fed together into the PLM-based encoder to learn unified representations. In addition, the proposed method utilizes additional random negative samples which can be reused in the each mini-batch during contrastive learning to learn a generalized entity representations. We verify the effectiveness of the our proposed method through various experiments and analysis. The experimental results on standard benchmark widely used in link prediction task show that the proposed model outperforms existing the SOTA KGC models. Especially, our method show the largest performance improvement on FB15K-237, which is competitive to the SOTA of structure-based KGC methods.
    Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How. (arXiv:2311.04898v1 [cs.LG])
    Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.
    MELEP: A Novel Predictive Measure of Transferability in Multi-Label ECG Analysis. (arXiv:2311.04224v1 [eess.SP])
    We introduce MELEP, which stands for Muti-label Expected Log of Empirical Predictions, a novel measure to estimate how effective it is to transfer knowledge from a pre-trained model to a downstream task in a multi-label settings. The measure is generic to work with new target data having a different label set from source data. It is also computationally efficient, only requires forward passing the downstream dataset through the pre-trained model once. To the best of our knowledge, we are the first to develop such a transferability metric for multi-label ECG classification problems. Our experiments show that MELEP can predict the performance of pre-trained convolutional and recurrent deep neural networks, on small and imbalanced ECG data. Specifically, strong correlation coefficients, with absolute values exceeding 0.6 in most cases, were observed between MELEP and the actual average F1 scores of the fine-tuned models.
    Convex Methods for Constrained Linear Bandits. (arXiv:2311.04338v1 [cs.LG])
    Recently, bandit optimization has received significant attention in real-world safety-critical systems that involve repeated interactions with humans. While there exist various algorithms with performance guarantees in the literature, practical implementation of the algorithms has not received as much attention. This work presents a comprehensive study on the computational aspects of safe bandit algorithms, specifically safe linear bandits, by introducing a framework that leverages convex programming tools to create computationally efficient policies. In particular, we first characterize the properties of the optimal policy for safe linear bandit problem and then propose an end-to-end pipeline of safe linear bandit algorithms that only involves solving convex problems. We also numerically evaluate the performance of our proposed methods.
    Towards Few-Annotation Learning in Computer Vision: Application to Image Classification and Object Detection tasks. (arXiv:2311.04888v1 [cs.CV])
    In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.
    PB-LLM: Partially Binarized Large Language Models. (arXiv:2310.00034v2 [cs.LG] UPDATED)
    This paper explores network binarization, a radical form of quantization, compressing model weights to a single bit, specifically for Large Language Models (LLMs) compression. Due to previous binarization methods collapsing LLMs, we propose a novel approach, Partially-Binarized LLM (PB-LLM), which can achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of quantized LLMs. Specifically, our exploration first uncovers the ineffectiveness of naive applications of existing binarization algorithms and highlights the imperative role of salient weights in achieving low-bit quantization. Thus, PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage, i.e., partially-binarization. PB-LLM is extended to recover the capacities of quantized LMMs, by analyzing from the perspective of post-training quantization (PTQ) and quantization-aware training (QAT). Under PTQ, combining the concepts from GPTQ, we reconstruct the binarized weight matrix guided by the Hessian matrix and successfully recover the reasoning capacity of PB-LLM in low-bit. Under QAT, we freeze the salient weights during training, explore the derivation of optimal scaling factors crucial for minimizing the quantization error, and propose a scaling mechanism based on this derived scaling strategy for residual binarized weights. Those explorations and the developed methodologies significantly contribute to rejuvenating the performance of low-bit quantized LLMs and present substantial advancements in the field of network binarization for LLMs.The code is available at https://github.com/hahnyuan/BinaryLLM.
    Analysis and Applications of Deep Learning with Finite Samples in Full Life-Cycle Intelligence of Nuclear Power Generation. (arXiv:2311.04247v1 [cs.LG])
    The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.
    On Characterizing the Evolution of Embedding Space of Neural Networks using Algebraic Topology. (arXiv:2311.04592v1 [cs.LG])
    We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers. Motivated by existing studies using simplicial complexes on shallow fully connected networks (FCN), we present an extended analysis using Cubical homology instead, with a variety of popular deep architectures and real image datasets. We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value. The rate of decay in topological complexity (as a metric) helps quantify the impact of architectural choices on the generalization ability. Interestingly from a representation learning perspective, we highlight several invariances such as topological invariance of (1) an architecture on similar datasets; (2) embedding space of a dataset for architectures of variable depth; (3) embedding space to input resolution/size, and (4) data sub-sampling. In order to further demonstrate the link between expressivity \& the generalization capability of a network, we consider the task of ranking pre-trained models for downstream classification task (transfer learning). Compared to existing approaches, the proposed metric has a better correlation to the actually achievable accuracy via fine-tuning the pre-trained model.
    A Deep Learning Based Resource Allocator for Communication Systems with Dynamic User Utility Demands. (arXiv:2311.04600v1 [eess.SP])
    Deep learning (DL) based resource allocation (RA) has recently gained a lot of attention due to its performance efficiency. However, most of the related studies assume an ideal case where the number of users and their utility demands, e.g., data rate constraints, are fixed and the designed DL based RA scheme exploits a policy trained only for these fixed parameters. A computationally complex policy retraining is required whenever these parameters change. Therefore, in this paper, a DL based resource allocator (ALCOR) is introduced, which allows users to freely adjust their utility demands based on, e.g., their application layer. ALCOR employs deep neural networks (DNNs), as the policy, in an iterative optimization algorithm. The optimization algorithm aims to optimize the on-off status of users in a time-sharing problem to satisfy their utility demands in expectation. The policy performs unconstrained RA (URA) -- RA without taking into account user utility demands -- among active users to maximize the sum utility (SU) at each time instant. Based on the chosen URA scheme, ALCOR can perform RA in a model-based or model-free manner and in a centralized or distributed scenario. Derived convergence analyses provide guarantees for the convergence of ALCOR, and numerical experiments corroborate its effectiveness.
    Assessing Upper Limb Motor Function in the Immediate Post-Stroke Perioud Using Accelerometry. (arXiv:2311.04226v1 [cs.LG])
    Accelerometry has been extensively studied as an objective means of measuring upper limb function in patients post-stroke. The objective of this paper is to determine whether the accelerometry-derived measurements frequently used in more long-term rehabilitation studies can also be used to monitor and rapidly detect sudden changes in upper limb motor function in more recently hospitalized stroke patients. Six binary classification models were created by training on variable data window times of paretic upper limb accelerometer feature data. The models were assessed on their effectiveness for differentiating new input data into two classes: severe or moderately severe motor function. The classification models yielded Area Under the Curve (AUC) scores that ranged from 0.72 to 0.82 for 15-minute data windows to 0.77 to 0.94 for 120-minute data windows. These results served as a preliminary assessment and a basis on which to further investigate the efficacy of using accelerometry and machine learning to alert healthcare professionals to rapid changes in motor function in the days immediately following a stroke.
    Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models. (arXiv:2311.04378v1 [cs.LG])
    Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
    Natural Bayesian Cram\'er-Rao Bound with an Application to Covariance Estimation. (arXiv:2311.04748v1 [math.ST])
    In this paper, we propose to develop a new Cram\'er-Rao Bound (CRB) when the parameter to estimate lies in a manifold and follows a prior distribution. This derivation leads to a natural inequality between an error criteria based on geometrical properties and this new bound. This main contribution is illustrated in the problem of covariance estimation when the data follow a Gaussian distribution and the prior distribution is an inverse Wishart. Numerical simulation shows new results where the proposed CRB allows to exhibit interesting properties of the MAP estimator which are not observed with the classical Bayesian CRB.
    How to select an objective function using information theory. (arXiv:2212.06566v3 [cs.LG] UPDATED)
    In machine learning or scientific computing, model performance is measured with an objective function. But why choose one objective over another? Information theory gives one answer: To maximize the information in the model, select the objective function that represents the error in the fewest bits. To evaluate different objectives, transform them into likelihood functions. As likelihoods, their relative magnitude represents how strongly we should prefer one objective versus another, and the log of that relation represents the difference in their bit-length, as well as the difference in their uncertainty. In other words, prefer whichever objective minimizes the uncertainty. Under the information-theoretic paradigm, the ultimate objective is to maximize information (and minimize uncertainty), as opposed to any specific utility. We argue that this paradigm is well-suited to models that have many uses and no definite utility, like the large Earth system models used to understand the effects of climate change.
    Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability. (arXiv:2311.04449v1 [cs.LG])
    Binary Balanced Tree RvNNs (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-RvNNs cannot solve simple arithmetic tasks like ListOps. On the flip side, RvNNs (e.g., Beam Tree RvNN) that do succeed on ListOps (and other structure-sensitive tasks like formal logical inference) are generally several times more expensive than even RNNs. In this paper, we introduce a novel framework -- Recursion in Recursion (RIR) to strike a balance between the two sides - getting some of the benefits from both worlds. In RIR, we use a form of two-level nested recursion - where the outer recursion is a $k$-ary balanced tree model with another recursive model (inner recursion) implementing its cell function. For the inner recursion, we choose Beam Tree RvNNs (BT-RvNN). To adjust BT-RvNNs within RIR we also propose a novel strategy of beam alignment. Overall, this entails that the total recursive depth in RIR is upper-bounded by $k \log_k n$. Our best RIR-based model is the first model that demonstrates high ($\geq 90\%$) length-generalization performance on ListOps while at the same time being scalable enough to be trainable on long sequence inputs from LRA. Moreover, in terms of accuracy in the LRA language tasks, it performs competitively with Structured State Space Models (SSMs) without any special initialization - outperforming Transformers by a large margin. On the other hand, while SSMs can marginally outperform RIR on LRA, they (SSMs) fail to length-generalize on ListOps. Our code is available at: \url{https://github.com/JRC1995/BeamRecursionFamily/}.
    Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models. (arXiv:2311.04902v1 [cs.CL])
    Large Language Models (LLMs) with a billion or more parameters are prime targets for network pruning, which aims to reduce a portion of the network weights without compromising performance. Prior approaches such as Weights Magnitude, SparseGPT, and Wanda, either concentrated solely on weights or integrated weights with activations for sparsity. However, they overlooked the informative gradients derived from pretrained large language models. In this paper, we present a novel sparsity-centric pruning method for pretrained LLMs, termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner leverages the first-order term of the Taylor expansion, operating in a training-free manner by harnessing properly normalized gradients from a few calibration samples to determine the importance pruning score, and substantially outperforms competitive counterparts like SparseGPT and Wanda in multiple benchmarks. Intriguing, after incorporating gradients, the unstructured pruning method tends to reveal some structural patterns post-pruning, which mirrors the geometric interdependence inherent in the LLMs' parameter structure. Additionally, GBLM-Pruner functions without any subsequent retraining or weight updates to maintain its simplicity as other counterparts. Extensive evaluations on LLaMA-1 and LLaMA-2 across various language benchmarks and perplexity show that GBLM-Pruner surpasses magnitude pruning, Wanda (weights+activations) and SparseGPT (weights+activations+weight update) by significant margins. Our code and models are available at https://github.com/RocktimJyotiDas/GBLM-Pruner.
    Fast, accurate, and interpretable decoding of electrocorticographic signals using dynamic mode decomposition. (arXiv:2311.04225v1 [eess.SP])
    Dynamic mode (DM) decomposition decomposes spatiotemporal signals into basic oscillatory components (DMs). DMs can improve the accuracy of neural decoding when used with the nonlinear Grassmann kernel, compared to conventional power features. However, such kernel-based machine learning algorithms have three limitations: large computational time preventing real-time application, incompatibility with non-kernel algorithms, and low interpretability. Here, we propose a mapping function corresponding to the Grassmann kernel that explicitly transforms DMs into spatial DM (sDM) features, which can be used in any machine learning algorithm. Using electrocorticographic signals recorded during various movement and visual perception tasks, the sDM features were shown to improve the decoding accuracy and computational time compared to conventional methods. Furthermore, the components of the sDM features informative for decoding showed similar characteristics to the high-$\gamma$ power of the signals, but with higher trial-to-trial reproducibility. The proposed sDM features enable fast, accurate, and interpretable neural decoding.
    Likelihood Ratio Confidence Sets for Sequential Decision Making. (arXiv:2311.04402v1 [cs.LG])
    Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.
    What Can We Learn from Unlearnable Datasets?. (arXiv:2305.19254v3 [cs.LG] UPDATED)
    In an era of widespread web scraping, unlearnable dataset methods have the potential to protect data privacy by preventing deep neural networks from generalizing. But in addition to a number of practical limitations that make their use unlikely, we make a number of findings that call into question their ability to safeguard data. First, it is widely believed that neural networks trained on unlearnable datasets only learn shortcuts, simpler rules that are not useful for generalization. In contrast, we find that networks actually can learn useful features that can be reweighed for high test performance, suggesting that image protection is not assured. Unlearnable datasets are also believed to induce learning shortcuts through linear separability of added perturbations. We provide a counterexample, demonstrating that linear separability of perturbations is not a necessary condition. To emphasize why linearly separable perturbations should not be relied upon, we propose an orthogonal projection attack which allows learning from unlearnable datasets published in ICML 2021 and ICLR 2023. Our proposed attack is significantly less complex than recently proposed techniques.
    Exploring Best Practices for ECG Signal Processing in Machine Learning. (arXiv:2311.04229v1 [eess.SP])
    In this work we search for best practices in pre-processing of Electrocardiogram (ECG) signals in order to train better classifiers for the diagnosis of heart conditions. State of the art machine learning algorithms have achieved remarkable results in classification of some heart conditions using ECG data, yet there appears to be no consensus on pre-processing best practices. Is this lack of consensus due to different conditions and architectures requiring different processing steps for optimal performance? Is it possible that state of the art deep-learning models have rendered pre-processing unnecessary? In this work we apply down-sampling, normalization, and filtering functions to 3 different multi-label ECG datasets and measure their effects on 3 different high-performing time-series classifiers. We find that sampling rates as low as 50Hz can yield comparable results to the commonly used 500Hz. This is significant as smaller sampling rates will result in smaller datasets and models, which require less time and resources to train. Additionally, despite their common usage, we found min-max normalization to be slightly detrimental overall, and band-passing to make no measurable difference. We found the blind approach to pre-processing of ECGs for multi-label classification to be ineffective, with the exception of sample rate reduction which reliably reduces computational resources, but does not increase accuracy.
    The PetShop Dataset -- Finding Causes of Performance Issues across Microservices. (arXiv:2311.04806v1 [cs.DC])
    Identifying root causes for unexpected or undesirable behavior in complex systems is a prevalent challenge. This issue becomes especially crucial in modern cloud applications that employ numerous microservices. Although the machine learning and systems research communities have proposed various techniques to tackle this problem, there is currently a lack of standardized datasets for quantitative benchmarking. Consequently, research groups are compelled to create their own datasets for experimentation. This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system. We showcase how this dataset can be used to evaluate the accuracy of a variety of methods spanning different causal and non-causal characterisations of the root cause analysis problem. We hope the new dataset, available at https://github.com/amazon-science/petshop-root-cause-analysis/ enables further development of techniques in this important area.
    A Lightweight Architecture for Real-Time Neuronal-Spike Classification. (arXiv:2311.04808v1 [cs.AR])
    Electrophysiological recordings of neural activity in a mouse's brain are very popular among neuroscientists for understanding brain function. One particular area of interest is acquiring recordings from the Purkinje cells in the cerebellum in order to understand brain injuries and the loss of motor functions. However, current setups for such experiments do not allow the mouse to move freely and, thus, do not capture its natural behaviour since they have a wired connection between the animal's head stage and an acquisition device. In this work, we propose a lightweight neuronal-spike detection and classification architecture that leverages on the unique characteristics of the Purkinje cells to discard unneeded information from the sparse neural data in real time. This allows the (condensed) data to be easily stored on a removable storage device on the head stage, alleviating the need for wires. Our proposed implementation shows a >95% overall classification accuracy while still resulting in a small-form-factor design, which allows for the free movement of mice during experiments. Moreover, the power-efficient nature of the design and the usage of STT-RAM (Spin Transfer Torque Magnetic Random Access Memory) as the removable storage allows the head stage to easily operate on a tiny battery for up to approximately 4 days.
    Speech language models lack important brain-relevant semantics. (arXiv:2311.04664v1 [cs.CL])
    Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we eliminate information related to specific low-level stimulus features (textual, speech, and visual) in the language model representations, and observe how this intervention affects the alignment with fMRI brain recordings acquired while participants read versus listened to the same naturalistic stories. We further contrast our findings with speech-based language models, which would be expected to predict speech-evoked brain activity better, provided they model language processing in the brain well. Using our direct approach, we find that both text-based and speech-based language models align well with early sensory regions due to shared low-level features. Text-based models continue to align well with later language regions even after removing these features, while, surprisingly, speech-based models lose most of their alignment. These findings suggest that speech-based models can be further improved to better reflect brain-like language processing.
    AI-Enabled Unmanned Vehicle-Assisted Reconfigurable Intelligent Surfaces: Deployment, Prototyping, Experiments, and Opportunities. (arXiv:2311.04241v1 [eess.SP])
    The requirement of wireless data demands is increasingly high as the sixth-generation (6G) technology evolves. Reconfigurable intelligent surface (RIS) is promisingly deemed to be one of 6G techniques for extending service coverage, reducing power consumption, and enhancing spectral efficiency. In this article, we have provided some fundamentals of RIS deployment in theory and hardware perspectives as well as utilization of artificial intelligence (AI) and machine learning. We conducted an intelligent deployment of RIS (i-Dris) prototype, including dual-band auto-guided vehicle (AGV) assisted RISs associated with an mmWave base station (BS) and a receiver. The RISs are deployed on the AGV with configured incident/reflection angles. While, both the mmWave BS and receiver are associated with an edge server monitoring downlink packets for obtaining system throughput. We have designed a federated multi-agent reinforcement learning scheme associated with several AGV-RIS agents and sub-agents per AGV-RIS consisting of the deployment of position, height, orientation and elevation angles. The experimental results presented the stationary measurement in different aspects and scenarios. The i-Dris can reach up to 980 Mbps transmission throughput under a bandwidth of 100 MHz with comparably low complexity as well as rapid deployment, which outperforms the other existing works. At last, we highlight some opportunities and future issues in leveraging RIS-empowered wireless communication networks.
    Determination of toxic comments and unintended model bias minimization using Deep learning approach. (arXiv:2311.04789v1 [cs.LG])
    Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git
    Causal Scoring: A Framework for Effect Estimation, Effect Ordering, and Effect Classification. (arXiv:2206.12532v3 [stat.ML] UPDATED)
    This paper introduces causal scoring as a novel approach to frame causal estimation in the context of decision making. Causal scoring entails the estimation of scores that support decision making by providing insights into causal effects. We present three valuable causal interpretations of these scores: effect estimation (EE), effect ordering (EO), and effect classification (EC). In the EE interpretation, the causal score represents the effect itself. The EO interpretation implies that the score can serve as a proxy for the magnitude of the effect, enabling the sorting of individuals based on their causal effects. The EC interpretation enables the classification of individuals into high- and low-effect categories using a predefined threshold. We demonstrate the value of these alternative causal interpretations (EO and EC) through two key results. First, we show that aligning the statistical modeling with the desired causal interpretation improves the accuracy of causal estimation. Second, we establish that more flexible causal interpretations are plausible in a wider range of data-generating processes and propose conditions to assess their validity. We showcase the practical utility of the causal scoring framework through examples in diverse fields such as advertising, healthcare, and education, illustrating how it facilitates reasoning about flexible causal interpretations of statistical estimates in various contexts. The examples encompass confounded estimates, effect estimates on surrogate outcomes, and even predictions about non-causal quantities as potential causal scores.
    Sharp Spectral Rates for Koopman Operator Learning. (arXiv:2302.02004v4 [cs.LG] UPDATED)
    Nonlinear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator and its spectral decomposition from data is enabled by a number of algorithms. In this work we present for the first time non-asymptotic learning bounds for the Koopman eigenvalues and eigenfunctions. We focus on time-reversal-invariant stochastic dynamical systems, including the important example of Langevin dynamics. We analyze two popular estimators: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). Our results critically hinge on novel {minimax} estimation bounds for the operator norm error, that may be of independent interest. Our spectral learning bounds are driven by the simultaneous control of the operator norm error and a novel metric distortion functional of the estimated eigenfunctions. The bounds indicates that both EDMD and RRR have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. Our results shed new light on the emergence of spurious eigenvalues, an issue which is well known empirically. Numerical experiments illustrate the implications of the bounds in practice.
    RankAug: Augmented data ranking for text classification. (arXiv:2311.04535v1 [cs.CL])
    Research on data generation and augmentation has been focused majorly on enhancing generation models, leaving a notable gap in the exploration and refinement of methods for evaluating synthetic data. There are several text similarity metrics within the context of generated data filtering which can impact the performance of specific Natural Language Understanding (NLU) tasks, specifically focusing on intent and sentiment classification. In this study, we propose RankAug, a text-ranking approach that detects and filters out the top augmented texts in terms of being most similar in meaning with lexical and syntactical diversity. Through experiments conducted on multiple datasets, we demonstrate that the judicious selection of filtering techniques can yield a substantial improvement of up to 35% in classification accuracy for under-represented classes.
    Device Sampling and Resource Optimization for Federated Learning in Cooperative Edge Networks. (arXiv:2311.04350v1 [cs.NI])
    The conventional federated learning (FedL) architecture distributes machine learning (ML) across worker devices by having them train local models that are periodically aggregated by a server. FedL ignores two important characteristics of contemporary wireless networks, however: (i) the network may contain heterogeneous communication/computation resources, and (ii) there may be significant overlaps in devices' local data distributions. In this work, we develop a novel optimization methodology that jointly accounts for these factors via intelligent device sampling complemented by device-to-device (D2D) offloading. Our optimization methodology aims to select the best combination of sampled nodes and data offloading configuration to maximize FedL training accuracy while minimizing data processing and D2D communication resource consumption subject to realistic constraints on the network topology and device capabilities. Theoretical analysis of the D2D offloading subproblem leads to new FedL convergence bounds and an efficient sequential convex optimizer. Using these results, we develop a sampling methodology based on graph convolutional networks (GCNs) which learns the relationship between network attributes, sampled nodes, and D2D data offloading to maximize FedL accuracy. Through evaluation on popular datasets and real-world network measurements from our edge testbed, we find that our methodology outperforms popular device sampling methodologies from literature in terms of ML model performance, data processing overhead, and energy consumption.
    Lidar Annotation Is All You Need. (arXiv:2311.04777v1 [cs.CV])
    In recent years, computer vision has transformed fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is vital for precise object delineation. Autonomous driving represents one of the key areas where computer vision algorithms are applied. The task of road surface segmentation is crucial in self-driving systems, but it requires a labor-intensive annotation process in several data domains. The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation. Segmentation of point clouds requires time-consuming preliminary data preparation and a large amount of computational resources. The key innovation of our approach is the masked loss, addressing sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.
    Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection. (arXiv:2311.04588v1 [cs.LG])
    Machine Learning (ML) models become vulnerable to Model Stealing Attacks (MSA) when they are deployed as a service. In such attacks, the deployed model is queried repeatedly to build a labelled dataset. This dataset allows the attacker to train a thief model that mimics the original model. To maximize query efficiency, the attacker has to select the most informative subset of data points from the pool of available data. Existing attack strategies utilize approaches like Active Learning and Semi-Supervised learning to minimize costs. However, in the black-box setting, these approaches may select sub-optimal samples as they train only one thief model. Depending on the thief model's capacity and the data it was pretrained on, the model might even select noisy samples that harm the learning process. In this work, we explore the usage of an ensemble of deep learning models as our thief model. We call our attack Army of Thieves(AOT) as we train multiple models with varying complexities to leverage the crowd's wisdom. Based on the ensemble's collective decision, uncertain samples are selected for querying, while the most confident samples are directly included in the training data. Our approach is the first one to utilize an ensemble of thief models to perform model extraction. We outperform the base approaches of existing state-of-the-art methods by at least 3% and achieve a 21% higher adversarial sample transferability than previous work for models trained on the CIFAR-10 dataset.
    Information-Theoretic Generalization Bounds for Transductive Learning and its Applications. (arXiv:2311.04561v1 [cs.LG])
    In this paper, we develop data-dependent and algorithm-dependent generalization bounds for transductive learning algorithms in the context of information theory for the first time. We show that the generalization gap of transductive learning algorithms can be bounded by the mutual information between training labels and hypothesis. By innovatively proposing the concept of transductive supersamples, we go beyond the inductive learning setting and establish upper bounds in terms of various information measures. Furthermore, we derive novel PAC-Bayesian bounds and build the connection between generalization and loss landscape flatness under the transductive learning setting. Finally, we present the upper bounds for adaptive optimization algorithms and demonstrate the applications of results on semi-supervised learning and graph learning scenarios. Our theoretic results are validated on both synthetic and real-world datasets.
    Predicting Properties of Nodes via Community-Aware Features. (arXiv:2311.04730v1 [cs.SI])
    A community structure that is often present in complex networks plays an important role not only in their formation but also shapes dynamics of these networks, affecting properties of their nodes. In this paper, we propose a family of community-aware node features and then investigate their properties. We show that they have high predictive power for classification tasks. We also verify that they contain information that cannot be recovered neither by classical node features nor by node embeddings (both classical as well as structural).
    IoT-Based Environmental Control System for Fish Farms with Sensor Integration and Machine Learning Decision Support. (arXiv:2311.04258v1 [eess.SP])
    In response to the burgeoning global demand for seafood and the challenges of managing fish farms, we introduce an innovative IoT based environmental control system that integrates sensor technology and advanced machine learning decision support. Deploying a network of wireless sensors within the fish farm, we continuously collect real-time data on crucial environmental parameters, including water temperature, pH levels, humidity, and fish behavior. This data undergoes meticulous preprocessing to ensure its reliability, including imputation, outlier detection, feature engineering, and synchronization. At the heart of our system are four distinct machine learning algorithms: Random Forests predict and optimize water temperature and pH levels for the fish, fostering their health and growth; Support Vector Machines (SVMs) function as an early warning system, promptly detecting diseases and parasites in fish; Gradient Boosting Machines (GBMs) dynamically fine-tune the feeding schedule based on real-time environmental conditions, promoting resource efficiency and fish productivity; Neural Networks manage the operation of critical equipment like water pumps and heaters to maintain the desired environmental conditions within the farm. These machine learning algorithms collaboratively make real-time decisions to ensure that the fish farm's environmental conditions align with predefined specifications, leading to improved fish health and productivity while simultaneously reducing resource wastage, thereby contributing to increased profitability and sustainability. This research article showcases the power of data-driven decision support in fish farming, promising to meet the growing demand for seafood while emphasizing environmental responsibility and economic viability, thus revolutionizing the future of fish farming.
    LRM: Large Reconstruction Model for Single Image to 3D. (arXiv:2311.04400v1 [cs.CV])
    We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.
    Strategies for Parallelizing the Big-Means Algorithm: A Comprehensive Tutorial for Effective Big Data Clustering. (arXiv:2311.04517v1 [cs.DC])
    This study focuses on the optimization of the Big-means algorithm for clustering large-scale datasets, exploring four distinct parallelization strategies. We conducted extensive experiments to assess the computational efficiency, scalability, and clustering performance of each approach, revealing their benefits and limitations. The paper also delves into the trade-offs between computational efficiency and clustering quality, examining the impacts of various factors. Our insights provide practical guidance on selecting the best parallelization strategy based on available resources and dataset characteristics, contributing to a deeper understanding of parallelization techniques for the Big-means algorithm.
    Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. (arXiv:2309.08125v2 [cs.DC] UPDATED)
    Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f+1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $29.6x$.
    Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness. (arXiv:2310.18626v2 [cs.CV] UPDATED)
    We present a novel framework for generating adversarial benchmarks to evaluate the robustness of image classification models. Our framework allows users to customize the types of distortions to be optimally applied to images, which helps address the specific distortions relevant to their deployment. The benchmark can generate datasets at various distortion levels to assess the robustness of different image classifiers. Our results show that the adversarial samples generated by our framework with any of the image classification models, like ResNet-50, Inception-V3, and VGG-16, are effective and transferable to other models causing them to fail. These failures happen even when these models are adversarially retrained using state-of-the-art techniques, demonstrating the generalizability of our adversarial samples. We achieve competitive performance in terms of net $L_2$ distortion compared to state-of-the-art benchmark techniques on CIFAR-10 and ImageNet; however, we demonstrate our framework achieves such results with simple distortions like Gaussian noise without introducing unnatural artifacts or color bleeds. This is made possible by a model-based reinforcement learning (RL) agent and a technique that reduces a deep tree search of the image for model sensitivity to perturbations, to a one-level analysis and action. The flexibility of choosing distortions and setting classification probability thresholds for multiple classes makes our framework suitable for algorithmic audits.
    Lie Point Symmetry and Physics Informed Networks. (arXiv:2311.04293v1 [cs.LG])
    Symmetries have been leveraged to improve the generalization of neural networks through different mechanisms from data augmentation to equivariant architectures. However, despite their potential, their integration into neural solvers for partial differential equations (PDEs) remains largely unexplored. We explore the integration of PDE symmetries, known as Lie point symmetries, in a major family of neural solvers known as physics-informed neural networks (PINNs). We propose a loss function that informs the network about Lie point symmetries in the same way that PINN models try to enforce the underlying PDE through a loss function. Intuitively, our symmetry loss ensures that the infinitesimal generators of the Lie group conserve the PDE solutions. Effectively, this means that once the network learns a solution, it also learns the neighbouring solutions generated by Lie point symmetries. Empirical evaluations indicate that the inductive bias introduced by the Lie point symmetries of the PDEs greatly boosts the sample efficiency of PINNs.
    PDFTriage: Question Answering over Long, Structured Documents. (arXiv:2309.08872v2 [cs.CL] UPDATED)
    Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
    Multitask Kernel-based Learning with First-Order Logic Constraints. (arXiv:2311.03340v2 [cs.LG] UPDATED)
    In this paper we propose a general framework to integrate supervised and unsupervised examples with background knowledge expressed by a collection of first-order logic clauses into kernel machines. In particular, we consider a multi-task learning scheme where multiple predicates defined on a set of objects are to be jointly learned from examples, enforcing a set of FOL constraints on the admissible configurations of their values. The predicates are defined on the feature spaces, in which the input objects are represented, and can be either known a priori or approximated by an appropriate kernel-based learner. A general approach is presented to convert the FOL clauses into a continuous implementation that can deal with the outputs computed by the kernel-based predicates. The learning problem is formulated as a semi-supervised task that requires the optimization in the primal of a loss function that combines a fitting loss measure on the supervised examples, a regularization term, and a penalty term that enforces the constraints on both the supervised and unsupervised examples. Unfortunately, the penalty term is not convex and it can hinder the optimization process. However, it is possible to avoid poor solutions by using a two stage learning schema, in which the supervised examples are learned first and then the constraints are enforced.
    Robust Best-arm Identification in Linear Bandits. (arXiv:2311.04731v1 [cs.LG])
    We study the robust best-arm identification problem (RBAI) in the case of linear rewards. The primary objective is to identify a near-optimal robust arm, which involves selecting arms at every round and assessing their robustness by exploring potential adversarial actions. This approach is particularly relevant when utilizing a simulator and seeking to identify a robust solution for real-world transfer. To this end, we present an instance-dependent lower bound for the robust best-arm identification problem with linear rewards. Furthermore, we propose both static and adaptive bandit algorithms that achieve sample complexity that matches the lower bound. In synthetic experiments, our algorithms effectively identify the best robust arm and perform similarly to the oracle strategy. As an application, we examine diabetes care and the process of learning insulin dose recommendations that are robust with respect to inaccuracies in standard calculators. Our algorithms prove to be effective in identifying robust dosage values across various age ranges of patients.
    Uncertainty in GNN Learning Evaluations: The Importance of a Consistent Benchmark for Community Detection. (arXiv:2305.06026v4 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have improved unsupervised community detection of clustered nodes due to their ability to encode the dual dimensionality of the connectivity and feature information spaces of graphs. Identifying the latent communities has many practical applications from social networks to genomics. Current benchmarks of real world performance are confusing due to the variety of decisions influencing the evaluation of GNNs at this task. To address this, we propose a framework to establish a common evaluation protocol. We motivate and justify it by demonstrating the differences with and without the protocol. The W Randomness Coefficient is a metric proposed for assessing the consistency of algorithm rankings to quantify the reliability of results under the presence of randomness. We find that by ensuring the same evaluation criteria is followed, there may be significant differences from the reported performance of methods at this task, but a more complete evaluation and comparison of methods is possible.
    Foundational propositions of hesitant fuzzy sets and parameter reductions of hesitant fuzzy information systems. (arXiv:2311.04256v1 [cs.AI])
    Hesitant fuzzy sets are widely used in the instances of uncertainty and hesitation. The inclusion relationship is an important and foundational definition for sets. Hesitant fuzzy set, as a kind of set, needs explicit definition of inclusion relationship. Base on the hesitant fuzzy membership degree of discrete form, several kinds of inclusion relationships for hesitant fuzzy sets are proposed. And then some foundational propositions of hesitant fuzzy sets and the families of hesitant fuzzy sets are presented. Finally, some foundational propositions of hesitant fuzzy information systems with respect to parameter reductions are put forward, and an example and an algorithm are given to illustrate the processes of parameter reductions.
    Hybrid Focal and Full-Range Attention Based Graph Transformers. (arXiv:2311.04653v1 [cs.LG])
    The paradigm of Transformers using the self-attention mechanism has manifested its advantage in learning graph-structured data. Yet, Graph Transformers are capable of modeling full range dependencies but are often deficient in extracting information from locality. A common practice is to utilize Message Passing Neural Networks (MPNNs) as an auxiliary to capture local information, which however are still inadequate for comprehending substructures. In this paper, we present a purely attention-based architecture, namely Focal and Full-Range Graph Transformer (FFGT), which can mitigate the loss of local information in learning global correlations. The core component of FFGT is a new mechanism of compound attention, which combines the conventional full-range attention with K-hop focal attention on ego-nets to aggregate both global and local information. Beyond the scope of canonical Transformers, the FFGT has the merit of being more substructure-aware. Our approach enhances the performance of existing Graph Transformers on various open datasets, while achieves compatible SOTA performance on several Long-Range Graph Benchmark (LRGB) datasets even with a vanilla transformer. We further examine influential factors on the optimal focal length of attention via introducing a novel synthetic dataset based on SBM-PATTERN.
    An Unsupervised Deep Learning Approach for the Wave Equation Inverse Problem. (arXiv:2311.04531v1 [math.NA])
    Full-waveform inversion (FWI) is a powerful geophysical imaging technique that infers high-resolution subsurface physical parameters by solving a non-convex optimization problem. However, due to limitations in observation, e.g., limited shots or receivers, and random noise, conventional inversion methods are confronted with numerous challenges, such as the local-minimum problem. In recent years, a substantial body of work has demonstrated that the integration of deep neural networks and partial differential equations for solving full-waveform inversion problems has shown promising performance. In this work, drawing inspiration from the expressive capacity of neural networks, we provide an unsupervised learning approach aimed at accurately reconstructing subsurface physical velocity parameters. This method is founded on a re-parametrization technique for Bayesian inference, achieved through a deep neural network with random weights. Notably, our proposed approach does not hinge upon the requirement of the labeled training dataset, rendering it exceedingly versatile and adaptable to diverse subsurface models. Extensive experiments show that the proposed approach performs noticeably better than existing conventional inversion methods.
    Learning Performance-Improving Code Edits. (arXiv:2302.07867v4 [cs.SE] UPDATED)
    With the waning of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77K competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements". To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves an average speedup of 5.65X on CodeLlama-13B and 6.86X on GPT-3.5, surpassing the best human performance (4.06X). We find our proposed performance-conditioned generation is particularly effective at improving performance as well as increasing the fraction of optimized programs.
    Extending Machine Learning-Based Early Sepsis Detection to Different Demographics. (arXiv:2311.04325v1 [cs.LG])
    Sepsis requires urgent diagnosis, but research is predominantly focused on Western datasets. In this study, we perform a comparative analysis of two ensemble learning methods, LightGBM and XGBoost, using the public eICU-CRD dataset and a private South Korean St. Mary's Hospital's dataset. Our analysis reveals the effectiveness of these methods in addressing healthcare data imbalance and enhancing sepsis detection. Specifically, LightGBM shows a slight edge in computational efficiency and scalability. The study paves the way for the broader application of machine learning in critical care, thereby expanding the reach of predictive analytics in healthcare globally.
    MarioGPT: Open-Ended Text2Level Generation through Large Language Models. (arXiv:2302.05981v3 [cs.AI] UPDATED)
    Procedural Content Generation (PCG) is a technique to generate complex and diverse environments in an automated way. However, while generating content with PCG methods is often straightforward, generating meaningful content that reflects specific intentions and constraints remains challenging. Furthermore, many PCG algorithms lack the ability to generate content in an open-ended manner. Recently, Large Language Models (LLMs) have shown to be incredibly effective in many diverse domains. These trained LLMs can be fine-tuned, re-using information and accelerating training for new tasks. Here, we introduce MarioGPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. MarioGPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. As far as we know, MarioGPT is the first text-to-level model and combined with novelty search it enables the generation of diverse levels with varying play-style dynamics (i.e. player paths) and the open-ended discovery of an increasingly diverse range of content. Code available at https://github.com/shyamsn97/mario-gpt.
    Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction. (arXiv:2311.02898v2 [eess.AS] UPDATED)
    We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.
    Optimal Deep Neural Network Approximation for Korobov Functions with respect to Sobolev Norms. (arXiv:2311.04779v1 [math.NA])
    This paper establishes the nearly optimal rate of approximation for deep neural networks (DNNs) when applied to Korobov functions, effectively overcoming the curse of dimensionality. The approximation results presented in this paper are measured with respect to $L_p$ norms and $H^1$ norms. Our achieved approximation rate demonstrates a remarkable "super-convergence" rate, outperforming traditional methods and any continuous function approximator. These results are non-asymptotic, providing error bounds that consider both the width and depth of the networks simultaneously.
    Exploring Predicate Visual Context in Detecting Human-Object Interactions. (arXiv:2308.06202v2 [cs.CV] UPDATED)
    Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost.
    Learning Linear Gaussian Polytree Models with Interventions. (arXiv:2311.04636v1 [stat.ML])
    We present a consistent and highly scalable local approach to learn the causal structure of a linear Gaussian polytree using data from interventional experiments with known intervention targets. Our methods first learn the skeleton of the polytree and then orient its edges. The output is a CPDAG representing the interventional equivalence class of the polytree of the true underlying distribution. The skeleton and orientation recovery procedures we use rely on second order statistics and low-dimensional marginal distributions. We assess the performance of our methods under different scenarios in synthetic data sets and apply our algorithm to learn a polytree in a gene expression interventional data set. Our simulation studies demonstrate that our approach is fast, has good accuracy in terms of structural Hamming distance, and handles problems with thousands of nodes.
    From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion. (arXiv:2308.02560v2 [cs.SD] UPDATED)
    Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the conditioning is flawed or imperfect. An alternative modeling approach is to use diffusion models. However, these have mainly been used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating relatively low sampling rate signals. In this work, we propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality (e.g., speech, music, environmental sounds) from low-bitrate discrete representations. At equal bit rate, the proposed approach outperforms state-of-the-art generative techniques in terms of perceptual quality. Training and, evaluation code, along with audio samples, are available on the facebookresearch/audiocraft Github page.
    Towards Open-world Cross-Domain Sequential Recommendation: A Model-Agnostic Contrastive Denoising Approach. (arXiv:2311.04760v1 [cs.IR])
    Cross-domain sequential recommendation (CDSR) aims to address the data sparsity problems that exist in traditional sequential recommendation (SR) systems. The existing approaches aim to design a specific cross-domain unit that can transfer and propagate information across multiple domains by relying on overlapping users with abundant behaviors. However, in real-world recommender systems, CDSR scenarios usually consist of a majority of long-tailed users with sparse behaviors and cold-start users who only exist in one domain. This leads to a drop in the performance of existing CDSR methods in the real-world industry platform. Therefore, improving the consistency and effectiveness of models in open-world CDSR scenarios is crucial for constructing CDSR models (\textit{1st} CH). Recently, some SR approaches have utilized auxiliary behaviors to complement the information for long-tailed users. However, these multi-behavior SR methods cannot deliver promising performance in CDSR, as they overlook the semantic gap between target and auxiliary behaviors, as well as user interest deviation across domains (\textit{2nd} CH).
    Object-Centric Learning with Slot Mixture Module. (arXiv:2311.04640v1 [cs.LG])
    Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.
    Solving High Frequency and Multi-Scale PDEs with Gaussian Processes. (arXiv:2311.04465v1 [cs.LG])
    Machine learning based solvers have garnered much attention in physical simulation and scientific computing, with a prominent example, physics-informed neural networks (PINNs). However, PINNs often struggle to solve high-frequency and multi-scale PDEs, which can be due to spectral bias during neural network training. To address this problem, we resort to the Gaussian process (GP) framework. To flexibly capture the dominant frequencies, we model the power spectrum of the PDE solution with a student t mixture or Gaussian mixture. We then apply the inverse Fourier transform to obtain the covariance function (according to the Wiener-Khinchin theorem). The covariance derived from the Gaussian mixture spectrum corresponds to the known spectral mixture kernel. We are the first to discover its rationale and effectiveness for PDE solving. Next,we estimate the mixture weights in the log domain, which we show is equivalent to placing a Jeffreys prior. It automatically induces sparsity, prunes excessive frequencies, and adjusts the remaining toward the ground truth. Third, to enable efficient and scalable computation on massive collocation points, which are critical to capture high frequencies, we place the collocation points on a grid, and multiply our covariance function at each input dimension. We use the GP conditional mean to predict the solution and its derivatives so as to fit the boundary condition and the equation itself. As a result, we can derive a Kronecker product structure in the covariance matrix. We use Kronecker product properties and multilinear algebra to greatly promote computational efficiency and scalability, without any low-rank approximations. We show the advantage of our method in systematic experiments.
    Towards a Unified Framework of Contrastive Learning for Disentangled Representations. (arXiv:2311.04774v1 [cs.LG])
    Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.
    Identifying Semantic Component for Robust Molecular Property Prediction. (arXiv:2311.04837v1 [cs.LG])
    Although graph neural networks have achieved great success in the task of molecular property prediction in recent years, their generalization ability under out-of-distribution (OOD) settings is still under-explored. Different from existing methods that learn discriminative representations for prediction, we propose a generative model with semantic-components identifiability, named SCI. We demonstrate that the latent variables in this generative model can be explicitly identified into semantic-relevant (SR) and semantic-irrelevant (SI) components, which contributes to better OOD generalization by involving minimal change properties of causal mechanisms. Specifically, we first formulate the data generation process from the atom level to the molecular level, where the latent space is split into SI substructures, SR substructures, and SR atom variables. Sequentially, to reduce misidentification, we restrict the minimal changes of the SR atom variables and add a semantic latent substructure regularization to mitigate the variance of the SR substructure under augmented domain changes. Under mild assumptions, we prove the block-wise identifiability of the SR substructure and the comment-wise identifiability of SR atom variables. Experimental studies achieve state-of-the-art performance and show general improvement on 21 datasets in 3 mainstream benchmarks. Moreover, the visualization results of the proposed SCI method provide insightful case studies and explanations for the prediction results. The code is available at: https://github.com/DMIRLAB-Group/SCI.
    Evaluating Uncertainty Quantification approaches for Neural PDEs in scientific applications. (arXiv:2311.04457v1 [cs.LG])
    The accessibility of spatially distributed data, enabled by affordable sensors, field, and numerical experiments, has facilitated the development of data-driven solutions for scientific problems, including climate change, weather prediction, and urban planning. Neural Partial Differential Equations (Neural PDEs), which combine deep learning (DL) techniques with domain expertise (e.g., governing equations) for parameterization, have proven to be effective in capturing valuable correlations within spatiotemporal datasets. However, sparse and noisy measurements coupled with modeling approximation introduce aleatoric and epistemic uncertainties. Therefore, quantifying uncertainties propagated from model inputs to outputs remains a challenge and an essential goal for establishing the trustworthiness of Neural PDEs. This work evaluates various Uncertainty Quantification (UQ) approaches for both Forward and Inverse Problems in scientific applications. Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach.
    Spectral Evolution and Invariance in Linear-width Neural Networks. (arXiv:2211.06506v2 [cs.LG] UPDATED)
    We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
    Accurate Autism Spectrum Disorder prediction using Support Vector Classifier based on Federated Learning (SVCFL). (arXiv:2311.04606v1 [cs.LG])
    The path to an autism diagnosis can be long and difficult, and delays can have serious consequences. Artificial intelligence can completely change the way autism is diagnosed, especially when it comes to situations where it is difficult to see the first signs of the disease. AI-based diagnostic tools may help confirm a diagnosis or highlight the need for further testing by analyzing large volumes of data and uncovering patterns that may not be immediately apparent to human evaluators. After a successful and timely diagnosis, autism can be treated through artificial intelligence using various methods. In this article, by using four datasets and gathering them with the federated learning method and diagnosing them with the support vector classifier method, the early diagnosis of this disorder has been discussed. In this method, we have achieved 99% accuracy for predicting autism spectrum disorder and we have achieved 13% improvement in the results.
    Long-term Time Series Forecasting based on Decomposition and Neural Ordinary Differential Equations. (arXiv:2311.04522v1 [cs.LG])
    Long-term time series forecasting (LTSF) is a challenging task that has been investigated in various domains such as finance investment, health care, traffic, and weather forecasting. In recent years, Linear-based LTSF models showed better performance, pointing out the problem of Transformer-based approaches causing temporal information loss. However, Linear-based approach has also limitations that the model is too simple to comprehensively exploit the characteristics of the dataset. To solve these limitations, we propose LTSF-DNODE, which applies a model based on linear ordinary differential equations (ODEs) and a time series decomposition method according to data statistical characteristics. We show that LTSF-DNODE outperforms the baselines on various real-world datasets. In addition, for each dataset, we explore the impacts of regularization in the neural ordinary differential equation (NODE) framework.
    TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models. (arXiv:2306.08013v4 [cs.LG] UPDATED)
    We propose a robust and reliable evaluation metric for generative models by introducing topological and statistical treatments for rigorous support estimation. Existing metrics, such as Inception Score (IS), Frechet Inception Distance (FID), and the variants of Precision and Recall (P&R), heavily rely on supports that are estimated from sample features. However, the reliability of their estimation has not been seriously discussed (and overlooked) even though the quality of the evaluation entirely depends on it. In this paper, we propose Topological Precision and Recall (TopP&R, pronounced 'topper'), which provides a systematic approach to estimating supports, retaining only topologically and statistically important features with a certain level of confidence. This not only makes TopP&R strong for noisy features, but also provides statistical consistency. Our theoretical and experimental results show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations, while accurately capturing the true trend of change in samples. To the best of our knowledge, this is the first evaluation metric focused on the robust estimation of the support and provides its statistical consistency under noise.
    Toward Rapid, Optimal, and Feasible Power Dispatch through Generalized Neural Mapping. (arXiv:2311.04838v1 [eess.SY])
    The evolution towards a more distributed and interconnected grid necessitates large-scale decision-making within strict temporal constraints. Machine learning (ML) paradigms have demonstrated significant potential in improving the efficacy of optimization processes. However, the feasibility of solutions derived from ML models continues to pose challenges. It's imperative that ML models produce solutions that are attainable and realistic within the given system constraints of power systems. To address the feasibility issue and expedite the solution search process, we proposed LOOP-LC 2.0(Learning to Optimize the Optimization Process with Linear Constraints version 2.0) as a learning-based approach for solving the power dispatch problem. A notable advantage of the LOOP-LC 2.0 framework is its ability to ensure near-optimality and strict feasibility of solutions without depending on computationally intensive post-processing procedures, thus eliminating the need for iterative processes. At the heart of the LOOP-LC 2.0 model lies the newly proposed generalized gauge map method, capable of mapping any infeasible solution to a feasible point within the linearly-constrained domain. The proposed generalized gauge map method improves the traditional gauge map by exhibiting reduced sensitivity to input variances while increasing search speeds significantly. Utilizing the IEEE-200 test case as a benchmark, we demonstrate the effectiveness of the LOOP-LC 2.0 methodology, confirming its superior performance in terms of training speed, computational time, optimality, and solution feasibility compared to existing methodologies.
    Joint control variate for faster black-box variational inference. (arXiv:2210.07290v3 [cs.LG] UPDATED)
    Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.
    Future Lens: Anticipating Subsequent Tokens from a Single Hidden State. (arXiv:2311.04897v1 [cs.CL])
    We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.
    Variational Classification. (arXiv:2305.10406v3 [cs.LG] UPDATED)
    We present a latent variable model for classification that provides a novel probabilistic interpretation of neural network softmax classifiers. We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders, that generalises the cross-entropy loss used to train classification models. Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency between their anticipated distribution, required for accurate label predictions to be output, and the empirical distribution found in practice. We augment the variational objective to mitigate such inconsistency and encourage a chosen latent distribution, instead of the implicit assumption in off-the-shelf softmax classifiers. Overall, we provide new theoretical insight into the inner workings of widely-used softmax classification. Empirical evaluation on image and text classification datasets demonstrates that our proposed approach, variational classification, maintains classification accuracy while the reshaped latent space improves other desirable properties of a classifier, such as calibration, adversarial robustness, robustness to distribution shift and sample efficiency useful in low data settings.
    HKTGNN: Hierarchical Knowledge Transferable Graph Neural Network-based Supply Chain Risk Assessment. (arXiv:2311.04244v1 [cs.LG])
    The strength of a supply chain is an important measure of a country's or region's technical advancement and overall competitiveness. Establishing supply chain risk assessment models for effective management and mitigation of potential risks has become increasingly crucial. As the number of businesses grows, the important relationships become more complicated and difficult to measure. This emphasizes the need of extracting relevant information from graph data. Previously, academics mostly employed knowledge inference to increase the visibility of links between nodes in the supply chain. However, they have not solved the data hunger problem of single node feature characteristics. We propose a hierarchical knowledge transferable graph neural network-based (HKTGNN) supply chain risk assessment model to address these issues. Our approach is based on current graph embedding methods for assessing corporate investment risk assessment. We embed the supply chain network corresponding to individual goods in the supply chain using the graph embedding module, resulting in a directed homogeneous graph with just product nodes. This reduces the complicated supply chain network into a basic product network. It addresses difficulties using the domain difference knowledge transferable module based on centrality, which is presented by the premise that supply chain feature characteristics may be biased in the actual world. Meanwhile, the feature complement and message passing will alleviate the data hunger problem, which is driven by domain differences. Our model outperforms in experiments on a real-world supply chain dataset. We will give an equation to prove that our comparative experiment is both effective and fair.
    Incorporating temporal dynamics of mutations to enhance the prediction capability of antiretroviral therapy's outcome for HIV-1. (arXiv:2311.04846v1 [cs.LG])
    Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measurements, can bring improvements. We introduce a method to weigh mutations, considering the previously enumerated factors and the reference mutation-drug Stanford resistance tables. We compare a model encompassing history (H) with one not using it (NH). Results: The H-model demonstrates superior discriminative ability, with a higher ROC-AUC score (76.34%) than the NH-model (74.98%). Significant Wilcoxon test results confirm that incorporating historical information improves consistently predictive accuracy for treatment outcomes. The better performance of the H-model might be attributed to its consideration of latent HIV reservoirs, probably obtained when leveraging historical information. The findings emphasize the importance of temporal dynamics in mutations, offering insights into HIV infection complexities. However, our result also shows that prediction accuracy remains relatively high even when no historical information is available. Supplementary information: Supplementary material is available.
    Challenging Common Assumptions in Multi-task Learning. (arXiv:2311.04698v1 [cs.LG])
    While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we challenge common assumptions in MTL in the context of STL: First, the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL. We deduce the effectiveness of Adam to its partial loss-scale invariance. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Lastly, we compare the transferability of features learned through MTL and STL on common image corruptions, and find no conclusive evidence that MTL leads to superior transferability. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
    Training CLIP models on Data from Scientific Papers. (arXiv:2311.04711v1 [cs.CV])
    Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.
    Massive Editing for Large Language Models via Meta Learning. (arXiv:2311.04661v1 [cs.CL])
    While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.
    Investigating Navigation Strategies in the Morris Water Maze through Deep Reinforcement Learning. (arXiv:2306.01066v2 [cs.LG] UPDATED)
    Navigation is a complex skill with a long history of research in animals and humans. In this work, we simulate the Morris Water Maze in 2D to train deep reinforcement learning agents. We perform automatic classification of navigation strategies, analyze the distribution of strategies used by artificial agents, and compare them with experimental data to show similar learning dynamics as those seen in humans and rodents. We develop environment-specific auxiliary tasks and examine factors affecting their usefulness. We suggest that the most beneficial tasks are potentially more biologically feasible for real agents to use. Lastly, we explore the development of internal representations in the activations of artificial agent neural networks. These representations resemble place cells and head-direction cells found in mouse brains, and their presence has correlation to the navigation strategies that artificial agents employ.
    Policy Space Diversity for Non-Transitive Games. (arXiv:2306.16884v2 [cs.GT] UPDATED)
    Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
    Optimized measurements of chaotic dynamical systems via the information bottleneck. (arXiv:2311.04896v1 [cs.LG])
    Deterministic chaos permits a precise notion of a "perfect measurement" as one that, when obtained repeatedly, captures all of the information created by the system's evolution with minimal redundancy. Finding an optimal measurement is challenging, and has generally required intimate knowledge of the dynamics in the few cases where it has been done. We establish an equivalence between a perfect measurement and a variant of the information bottleneck. As a consequence, we can employ machine learning to optimize measurement processes that efficiently extract information from trajectory data. We obtain approximately optimal measurements for multiple chaotic maps and lay the necessary groundwork for efficient information extraction from general time series.
    Euclidean, Projective, Conformal: Choosing a Geometric Algebra for Equivariant Transformers. (arXiv:2311.04744v1 [cs.LG])
    The Geometric Algebra Transformer (GATr) is a versatile architecture for geometric deep learning based on projective geometric algebra. We generalize this architecture into a blueprint that allows one to construct a scalable transformer architecture given any geometric (or Clifford) algebra. We study versions of this architecture for Euclidean, projective, and conformal algebras, all of which are suited to represent 3D data, and evaluate them in theory and practice. The simplest Euclidean architecture is computationally cheap, but has a smaller symmetry group and is not as sample-efficient, while the projective model is not sufficiently expressive. Both the conformal algebra and an improved version of the projective algebra define powerful, performant architectures.
    MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation. (arXiv:2305.15296v2 [cs.CV] UPDATED)
    The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language.
    Debiasing Scores and Prompts of 2D Diffusion for View-consistent Text-to-3D Generation. (arXiv:2303.15413v4 [cs.CV] UPDATED)
    Existing score-distilling text-to-3D generation techniques, despite their considerable promise, often encounter the view inconsistency problem. One of the most notable issues is the Janus problem, where the most canonical view of an object (\textit{e.g}., face or head) appears in other views. In this work, we explore existing frameworks for score-distilling text-to-3D generation and identify the main causes of the view inconsistency problem -- the embedded bias of 2D diffusion models. Based on these findings, we propose two approaches to debias the score-distillation frameworks for view-consistent text-to-3D generation. Our first approach, called score debiasing, involves cutting off the score estimated by 2D diffusion models and gradually increasing the truncation value throughout the optimization process. Our second approach, called prompt debiasing, identifies conflicting words between user prompts and view prompts using a language model, and adjusts the discrepancy between view prompts and the viewing direction of an object. Our experimental results show that our methods improve the realism of the generated 3D objects by significantly reducing artifacts and achieve a good trade-off between faithfulness to the 2D diffusion models and 3D consistency with little overhead. Our project page is available at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.
    Data Factors for Better Compositional Generalization. (arXiv:2311.04420v1 [cs.CL])
    Recent diagnostic datasets on compositional generalization, such as SCAN (Lake and Baroni, 2018) and COGS (Kim and Linzen, 2020), expose severe problems in models trained from scratch on these datasets. However, in contrast to this poor performance, state-of-the-art models trained on larger and more general datasets show better generalization ability. In this work, to reconcile this inconsistency, we conduct an empirical analysis by training Transformer models on a variety of training sets with different data factors, including dataset scale, pattern complexity, example difficulty, etc. First, we show that increased dataset complexity can lead to better generalization behavior on multiple different generalization challenges. To further understand this improvement, we show two axes of the benefit from more complex datasets: they provide more diverse examples so compositional understanding becomes more effective, and they also prevent ungeneralizable memorization of the examples due to reduced example repetition frequency. Finally, we explore how training examples of different difficulty levels influence generalization differently. On synthetic datasets, simple examples invoke stronger compositionality than hard examples do. On larger-scale real language datasets, while hard examples become more important potentially to ensure decent data coverage, a balanced mixture of simple and hard examples manages to induce the strongest generalizability. The code and data for this work are available at https://github.com/owenzx/data4comp
    Evading Watermark based Detection of AI-Generated Content. (arXiv:2305.03807v5 [cs.LG] UPDATED)
    A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.
    ETDPC: A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations. (arXiv:2311.04262v1 [cs.CV])
    Electronic theses and dissertations (ETDs) have been proposed, advocated, and generated for more than 25 years. Although ETDs are hosted by commercial or institutional digital library repositories, they are still an understudied type of scholarly big data, partially because they are usually longer than conference proceedings and journals. Segmenting ETDs will allow researchers to study sectional content. Readers can navigate to particular pages of interest, discover, and explore the content buried in these long documents. Most existing frameworks on document page classification are designed for classifying general documents and perform poorly on ETDs. In this paper, we propose ETDPC. Its backbone is a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. To overcome the challenge of imbalanced labeled samples, we augmented data for minority categories and employed a hierarchical classifier. ETDPC outperforms the state-of-the-art models in all categories, achieving an F1 of 0.84 -- 0.96 for 9 out of 13 categories. We also demonstrated its data efficiency. The code and data can be found on GitHub (https://github.com/lamps-lab/ETDMiner/tree/master/etd_segmentation).
    Interpretability, Generalizability, and Memory of Reinforcement Learning Agents in Closed Drafting Games. (arXiv:2310.20654v2 [cs.LG] UPDATED)
    Closed drafting or "pick and pass" is a popular game mechanic where each round players select a card or other playable element from their hand and pass the rest to the next player. In this paper, we establish first-principle interpretability, generalizability, and memory benchmarks for studying model-free reinforcement learning (RL) algorithms playing closed drafting games. Specifically in a popular family of closed drafting games called "Sushi Go Party!", in which we achieve state-of-the-art performance. We fit decision rules to interpret the strategy of trained RL agents and compare these to the ranking preferences of different types of human players, finding easily understandable explanations of the disparate performance of RL agents in this environment. As Sushi Go Party! can be expressed as a set of closely-related games based on the set of cards in play, we quantify the generalizability of RL models trained on various sets of cards, establishing key trends between performance and the set distance between the train and evaluation game configurations. Using the explicitly calculable memory of other player's hands in closed drafting games, we create measures of the ability of RL models to learn memory.
    RoFormer: Enhanced Transformer with Rotary Position Embedding. (arXiv:2104.09864v5 [cs.CL] UPDATED)
    Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.
    Deep Learning Assisted Multiuser MIMO Load Modulated Systems for Enhanced Downlink mmWave Communications. (arXiv:2311.04537v1 [eess.SP])
    This paper is focused on multiuser load modulation arrays (MU-LMAs) which are attractive due to their low system complexity and reduced cost for millimeter wave (mmWave) multi-input multi-output (MIMO) systems. The existing precoding algorithm for downlink MU-LMA relies on a sub-array structured (SAS) transmitter which may suffer from decreased degrees of freedom and complex system configuration. Furthermore, a conventional LMA codebook with codewords uniformly distributed on a hypersphere may not be channel-adaptive and may lead to increased signal detection complexity. In this paper, we conceive an MU-LMA system employing a full-array structured (FAS) transmitter and propose two algorithms accordingly. The proposed FAS-based system addresses the SAS structural problems and can support larger numbers of users. For LMA-imposed constant-power downlink precoding, we propose an FAS-based normalized block diagonalization (FAS-NBD) algorithm. However, the forced normalization may result in performance degradation. This degradation, together with the aforementioned codebook design problems, is difficult to solve analytically. This motivates us to propose a Deep Learning-enhanced (FAS-DL-NBD) algorithm for adaptive codebook design and codebook-independent decoding. It is shown that the proposed algorithms are robust to imperfect knowledge of channel state information and yield excellent error performance. Moreover, the FAS-DL-NBD algorithm enables signal detection with low complexity as the number of bits per codeword increases.
    A Hierarchical Spatial Transformer for Massive Point Samples in Continuous Space. (arXiv:2311.04434v1 [cs.LG])
    Transformers are widely used deep learning architectures. Existing transformers are mostly designed for sequences (texts or time series), images or videos, and graphs. This paper proposes a novel transformer model for massive (up to a million) point samples in continuous space. Such data are ubiquitous in environment sciences (e.g., sensor observations), numerical simulations (e.g., particle-laden flow, astrophysics), and location-based services (e.g., POIs and trajectories). However, designing a transformer for massive spatial points is non-trivial due to several challenges, including implicit long-range and multi-scale dependency on irregular points in continuous space, a non-uniform point distribution, the potential high computational costs of calculating all-pair attention across massive points, and the risks of over-confident predictions due to varying point density. To address these challenges, we propose a new hierarchical spatial transformer model, which includes multi-resolution representation learning within a quad-tree hierarchy and efficient spatial attention via coarse approximation. We also design an uncertainty quantification branch to estimate prediction confidence related to input feature noise and point sparsity. We provide a theoretical analysis of computational time complexity and memory costs. Extensive experiments on both real-world and synthetic datasets show that our method outperforms multiple baselines in prediction accuracy and our model can scale up to one million points on one NVIDIA A100 GPU. The code is available at \url{https://github.com/spatialdatasciencegroup/HST}.  ( 2 min )
    InstrumentGen: Generating Sample-Based Musical Instruments From Text. (arXiv:2311.04339v1 [eess.AS])
    We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation.  ( 2 min )
    SaFL: Sybil-aware Federated Learning with Application to Face Recognition. (arXiv:2311.04346v1 [cs.CV])
    Federated Learning (FL) is a machine learning paradigm to conduct collaborative learning among clients on a joint model. The primary goal is to share clients' local training parameters with an integrating server while preserving their privacy. This method permits to exploit the potential of massive mobile users' data for the benefit of machine learning models' performance while keeping sensitive data on local devices. On the downside, FL raises security and privacy concerns that have just started to be studied. To address some of the key threats in FL, researchers have proposed to use secure aggregation methods (e.g. homomorphic encryption, secure multiparty computation, etc.). These solutions improve some security and privacy metrics, but at the same time bring about other serious threats such as poisoning attacks, backdoor attacks, and free running attacks. This paper proposes a new defense method against poisoning attacks in FL called SaFL (Sybil-aware Federated Learning) that minimizes the effect of sybils with a novel time-variant aggregation scheme.  ( 2 min )
    Class-Incremental Continual Learning for General Purpose Healthcare Models. (arXiv:2311.04301v1 [cs.LG])
    Healthcare clinics regularly encounter dynamic data that changes due to variations in patient populations, treatment policies, medical devices, and emerging disease patterns. Deep learning models can suffer from catastrophic forgetting when fine-tuned in such scenarios, causing poor performance on previously learned tasks. Continual learning allows learning on new tasks without performance drop on previous tasks. In this work, we investigate the performance of continual learning models on four different medical imaging scenarios involving ten classification datasets from diverse modalities, clinical specialties, and hospitals. We implement various continual learning approaches and evaluate their performance in these scenarios. Our results demonstrate that a single model can sequentially learn new tasks from different specialties and achieve comparable performance to naive methods. These findings indicate the feasibility of recycling or sharing models across the same or different medical specialties, offering another step towards the development of general-purpose medical imaging AI that can be shared across institutions.  ( 2 min )
    Holistic Evaluation of Text-To-Image Models. (arXiv:2311.04287v1 [cs.CV])
    The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.  ( 2 min )
    Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation. (arXiv:2311.04254v1 [cs.AI])
    Recent advancements in Large Language Models (LLMs) have revolutionized decision-making by breaking down complex problems into more manageable language sequences referred to as ``thoughts''. An effective thought design should consider three key perspectives: performance, efficiency, and flexibility. However, existing thought can at most exhibit two of these attributes. To address these limitations, we introduce a novel thought prompting approach called ``Everything of Thoughts'' (XoT) to defy the law of ``Penrose triangle of existing thought paradigms. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge into thoughts, thereby enhancing LLMs' capabilities and enabling them to generalize to unseen problems efficiently. Through the utilization of the MCTS-LLM collaborative thought revision framework, this approach autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to engage in unconstrained thinking, allowing for flexible cognitive mappings for problems with multiple solutions.  ( 2 min )
    MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters. (arXiv:2311.04251v1 [cs.LG])
    Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at https://github.com/chaudatascience/mixturegrowth.  ( 3 min )
    Zeroth-order Asynchronous Learning with Bounded Delays with a Use-case in Resource Allocation in Communication Networks. (arXiv:2311.04604v1 [eess.SP])
    Distributed optimization has experienced a significant surge in interest due to its wide-ranging applications in distributed learning and adaptation. While various scenarios, such as shared-memory, local-memory, and consensus-based approaches, have been extensively studied in isolation, there remains a need for further exploration of their interconnections. This paper specifically concentrates on a scenario where agents collaborate toward a unified mission while potentially having distinct tasks. Each agent's actions can potentially impact other agents through interactions. Within this context, the objective for the agents is to optimize their local parameters based on the aggregate of local reward functions, where only local zeroth-order oracles are available. Notably, the learning process is asynchronous, meaning that agents update and query their zeroth-order oracles asynchronously while communicating with other agents subject to bounded but possibly random communication delays. This paper presents theoretical convergence analyses and establishes a convergence rate for the proposed approach. Furthermore, it addresses the relevant issue of deep learning-based resource allocation in communication networks and conducts numerical experiments in which agents, acting as transmitters, collaboratively train their individual (possibly unique) policies to maximize a common performance metric.  ( 2 min )
    Adaptive Mirror Descent Bilevel Optimization. (arXiv:2311.04520v1 [math.OC])
    In the paper, we propose a class of efficient adaptive bilevel methods based on mirror descent for nonconvex bilevel optimization, where its upper-level problem is nonconvex possibly with nonsmooth regularization, and its lower-level problem is also nonconvex while satisfies Polyak-{\L}ojasiewicz (PL) condition. To solve these deterministic bilevel problems, we present an efficient adaptive projection-aid gradient (i.e., AdaPAG) method based on mirror descent, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-1})$ for finding an $\epsilon$-stationary solution of nonconvex bilevel problems. To solve these stochastic bilevel problems, we propose an efficient adaptive stochastic projection-aid gradient (i.e., AdaVSPAG) methods based on mirror descent and variance-reduced techniques, and prove that it obtains the best known gradient complexity of $O(\epsilon^{-3/2})$ for finding an $\epsilon$-stationary solution. Since the PL condition relaxes the strongly convex, our algorithms can be used to nonconvex strongly-convex bilevel optimization. Theoretically, we provide a useful convergence analysis framework for our methods under some mild conditions, and prove that our methods have a fast convergence rate of $O(\frac{1}{T})$, where $T$ denotes the number of iterations.  ( 2 min )
    Compilation of product-formula Hamiltonian simulation via reinforcement learning. (arXiv:2311.04285v1 [quant-ph])
    Hamiltonian simulation is believed to be one of the first tasks where quantum computers can yield a quantum advantage. One of the most popular methods of Hamiltonian simulation is Trotterization, which makes use of the approximation $e^{i\sum_jA_j}\sim \prod_je^{iA_j}$ and higher-order corrections thereto. However, this leaves open the question of the order of operations (i.e. the order of the product over $j$, which is known to affect the quality of approximation). In some cases this order is fixed by the desire to minimise the error of approximation; when it is not the case, we propose that the order can be chosen to optimize compilation to a native quantum architecture. This presents a new compilation problem -- order-agnostic quantum circuit compilation -- which we prove is NP-hard in the worst case. In lieu of an easily-computable exact solution, we turn to methods of heuristic optimization of compilation. We focus on reinforcement learning due to the sequential nature of the compilation task, comparing it to simulated annealing and Monte Carlo tree search. While two of the methods outperform a naive heuristic, reinforcement learning clearly outperforms all others, with a gain of around 12% with respect to the second-best method and of around 50% compared to the naive heuristic in terms of the gate count. We further test the ability of RL to generalize across instances of the compilation problem, and find that a single learner is able to solve entire problem families. This demonstrates the ability of machine learning techniques to provide assistance in an order-agnostic quantum compilation task.  ( 3 min )
    Towards Democratizing AI: A Comparative Analysis of AI as a Service Platforms and the Open Space for Machine Learning Approach. (arXiv:2311.04518v1 [cs.LG])
    Recent AI research has significantly reduced the barriers to apply AI, but the process of setting up the necessary tools and frameworks can still be a challenge. While AI-as-a-Service platforms have emerged to simplify the training and deployment of AI models, they still fall short of achieving true democratization of AI. In this paper, we aim to address this gap by comparing several popular AI-as-a-Service platforms and identifying the key requirements for a platform that can achieve true democratization of AI. Our analysis highlights the need for self-hosting options, high scalability, and openness. To address these requirements, we propose our approach: the "Open Space for Machine Learning" platform. Our platform is built on cutting-edge technologies such as Kubernetes, Kubeflow Pipelines, and Ludwig, enabling us to overcome the challenges of democratizing AI. We argue that our approach is more comprehensive and effective in meeting the requirements of democratizing AI than existing AI-as-a-Service platforms.  ( 2 min )
    CNN-Based Structural Damage Detection using Time-Series Sensor Data. (arXiv:2311.04252v1 [cs.LG])
    Structural Health Monitoring (SHM) is vital for evaluating structural condition, aiming to detect damage through sensor data analysis. It aligns with predictive maintenance in modern industry, minimizing downtime and costs by addressing potential structural issues. Various machine learning techniques have been used to extract valuable information from vibration data, often relying on prior structural knowledge. This research introduces an innovative approach to structural damage detection, utilizing a new Convolutional Neural Network (CNN) algorithm. In order to extract deep spatial features from time series data, CNNs are taught to recognize long-term temporal connections. This methodology combines spatial and temporal features, enhancing discrimination capabilities when compared to methods solely reliant on deep spatial features. Time series data are divided into two categories using the proposed neural network: undamaged and damaged. To validate its efficacy, the method's accuracy was tested using a benchmark dataset derived from a three-floor structure at Los Alamos National Laboratory (LANL). The outcomes show that the new CNN algorithm is very accurate in spotting structural degradation in the examined structure.  ( 2 min )
    Regression with Cost-based Rejection. (arXiv:2311.04550v1 [cs.LG])
    Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.  ( 2 min )
    Graph Neural Networks for Topological Feature Extraction in ECG Classification. (arXiv:2311.04228v1 [eess.SP])
    The electrocardiogram (ECG) is a dependable instrument for assessing the function of the cardiovascular system. There has recently been much emphasis on precisely classifying ECGs. While ECG situations have numerous similarities, little attention has been paid to categorizing ECGs using graph neural networks. In this study, we offer three distinct techniques for classifying heartbeats using deep graph neural networks to classify the ECG signals accurately. We suggest using different methods to extract topological features from the ECG signal and then using a branch of the graph neural network named graph isomorphism network for classifying the ECGs. On the PTB Diagnostics data set, we tested the three proposed techniques. According to the findings, the three proposed techniques are capable of making arrhythmia classification predictions with the accuracy of 99.38, 98.76, and 91.93 percent, respectively.  ( 2 min )
    Compressive Recovery of Sparse Precision Matrices. (arXiv:2311.04673v1 [stat.ML])
    We consider the problem of learning a graph modeling the statistical relations of the $d$ variables of a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a sketch of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using nonlinear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega((d+2k)\log(d))$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.  ( 2 min )
    Autonomous Advanced Aerial Mobility -- An End-to-end Autonomy Framework for UAVs and Beyond. (arXiv:2311.04472v1 [cs.RO])
    Developing aerial robots that can both safely navigate and execute assigned mission without any human intervention - i.e., fully autonomous aerial mobility of passengers and goods - is the larger vision that guides the research, design, and development efforts in the aerial autonomy space. However, it is highly challenging to concurrently operationalize all types of aerial vehicles that are operating fully autonomously sharing the airspace. Full autonomy of the aerial transportation sector includes several aspects, such as design of the technology that powers the vehicles, operations of multi-agent fleets, and process of certification that meets stringent safety requirements of aviation sector. Thereby, Autonomous Advanced Aerial Mobility is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we present a comprehensive perspective on the emerging field of autonomous advanced aerial mobility, which involves the use of unmanned aerial vehicles (UAVs) and electric vertical takeoff and landing (eVTOL) aircraft for various applications, such as urban air mobility, package delivery, and surveillance. The article proposes a scalable and extensible autonomy framework consisting of four main blocks: sensing, perception, planning, and controls. Furthermore, the article discusses the challenges and opportunities in multi-agent fleet operations and management, as well as the testing, validation, and certification aspects of autonomous aerial systems. Finally, the article explores the potential of monolithic models for aerial autonomy and analyzes their advantages and limitations. The perspective aims to provide a holistic picture of the autonomous advanced aerial mobility field and its future directions.  ( 3 min )
    Predicting Market Value in Professional Soccer: Insights from Explainable Machine Learning Models. (arXiv:2311.04599v1 [cs.LG])
    This study presents an innovative method for predicting the market value of professional soccer players using explainable machine learning models. Using a dataset curated from the FIFA website, we employ an ensemble machine learning approach coupled with Shapley Additive exPlanations (SHAP) to provide detailed explanations of the models' predictions. The GBDT model achieves the highest mean R-Squared (0.8780) and the lowest mean Root Mean Squared Error (3,221,632.175), indicating its superior performance among the evaluated models. Our analysis reveals that specific skills such as ball control, short passing, finishing, interceptions, dribbling, and tackling are paramount within the skill dimension, whereas sprint speed and acceleration are critical in the fitness dimension, and reactions are preeminent in the cognitive dimension. Our results offer a more accurate, objective, and consistent framework for market value estimation, presenting useful insights for managerial decisions in player transfers.  ( 2 min )
    Leveraging sinusoidal representation networks to predict fMRI signals from EEG. (arXiv:2311.04234v1 [eess.SP])
    In modern neuroscience, functional magnetic resonance imaging (fMRI) has been a crucial and irreplaceable tool that provides a non-invasive window into the dynamics of whole-brain activity. Nevertheless, fMRI is limited by hemodynamic blurring as well as high cost, immobility, and incompatibility with metal implants. Electroencephalography (EEG) is complementary to fMRI and can directly record the cortical electrical activity at high temporal resolution, but has more limited spatial resolution and is unable to recover information about deep subcortical brain structures. The ability to obtain fMRI information from EEG would enable cost-effective, imaging across a wider set of brain regions. Further, beyond augmenting the capabilities of EEG, cross-modality models would facilitate the interpretation of fMRI signals. However, as both EEG and fMRI are high-dimensional and prone to artifacts, it is currently challenging to model fMRI from EEG. To address this challenge, we propose a novel architecture that can predict fMRI signals directly from multi-channel EEG without explicit feature engineering. Our model achieves this by implementing a Sinusoidal Representation Network (SIREN) to learn frequency information in brain dynamics from EEG, which serves as the input to a subsequent encoder-decoder to effectively reconstruct the fMRI signal from a specific brain region. We evaluate our model using a simultaneous EEG-fMRI dataset with 8 subjects and investigate its potential for predicting subcortical fMRI signals. The present results reveal that our model outperforms a recent state-of-the-art model, and indicates the potential of leveraging periodic activation functions in deep neural networks to model functional neuroimaging data.  ( 3 min )
    Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning. (arXiv:2306.16750v2 [cs.LG] UPDATED)
    We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.
    Robotic Learning the Sequence of Packing Irregular Objects from Human Demonstrations. (arXiv:2210.01645v2 [cs.RO] UPDATED)
    We tackle the challenge of robotic bin packing with irregular objects, such as groceries. Given the diverse physical attributes of these objects and the complex constraints governing their placement and manipulation, employing preprogrammed strategies becomes unfeasible. Our approach is to learn directly from expert demonstrations in order to extract implicit task knowledge and strategies to ensure safe object positioning, efficient use of space, and the generation of human-like behaviors that enhance human-robot trust. We rely on human demonstrations to learn a Markov chain for predicting the object packing sequence for a given set of items and then compare it with human performance. Our experimental results show that the model outperforms human performance by generating sequence predictions that humans classify as human-like more frequently than human-generated sequences. The human demonstrations were collected using our proposed VR platform, BoxED, which is a box packaging environment for simulating real-world objects and scenarios for fast and streamlined data collection with the purpose of teaching robots. We collected data from 43 participants packing a total of 263 boxes with supermarket-like objects, yielding 4644 object manipulations. Our VR platform can be easily adapted to new scenarios and objects, and is publicly available, alongside our dataset, at https://github.com/andrejfsantos4/BoxED.
    CLearViD: Curriculum Learning for Video Description. (arXiv:2311.04480v1 [cs.CV])
    Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.  ( 2 min )
  • Open

    Hierarchical clustering with dot products recovers hidden tree structure. (arXiv:2305.15022v2 [stat.ML] UPDATED)
    In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
    Functional Bayesian Tucker Decomposition for Continuous-indexed Tensor Data. (arXiv:2311.04829v1 [cs.LG])
    Tucker decomposition is a powerful tensor model to handle multi-aspect data. It demonstrates the low-rank property by decomposing the grid-structured data as interactions between a core tensor and a set of object representations (factors). A fundamental assumption of such decomposition is that there were finite objects in each aspect or mode, corresponding to discrete indexes of data entries. However, many real-world data are not naturally posed in the setting. For example, geographic data is represented as continuous indexes of latitude and longitude coordinates, and cannot fit tensor models directly. To generalize Tucker decomposition to such scenarios, we propose Functional Bayesian Tucker Decomposition (FunBaT). We treat the continuous-indexed data as the interaction between the Tucker core and a group of latent functions. We use Gaussian processes (GP) as functional priors to model the latent functions, and then convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE) to reduce computational cost. An efficient inference algorithm is further developed for scalable posterior approximation based on advanced message-passing techniques. The advantage of our method is shown in both synthetic data and several real-world applications.
    Versatile Energy-Based Probabilistic Models for High Energy Physics. (arXiv:2302.00695v4 [cs.LG] UPDATED)
    As a classical generative modeling approach, energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In line with these advancements, we build a multi-purpose energy-based probabilistic model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator for physics simulation, a generic anomalous signal detector free from spurious correlations, and an augmented event classifier for particle identification.
    Spectral Evolution and Invariance in Linear-width Neural Networks. (arXiv:2211.06506v2 [cs.LG] UPDATED)
    We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the spectra of weight in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates; we provide a theoretical justification for this observation and prove the invariance of the bulk spectra for both conjugate and neural tangent kernels. We demonstrate similar characteristics when training with stochastic gradient descent with small learning rates. When the learning rate is large, we exhibit the emergence of an outlier whose corresponding eigenvector is aligned with the training data structure. We also show that after adaptive gradient training, where a lower test error and feature learning emerge, both weight and kernel matrices exhibit heavy tail behavior. Simple examples are provided to explain when heavy tails can have better generalizations. We exhibit different spectral properties such as invariant bulk, spike, and heavy-tailed distribution from a two-layer neural network using different training strategies, and then correlate them to the feature learning. Analogous phenomena also appear when we train conventional neural networks with real-world data. We conclude that monitoring the evolution of the spectra during training is an essential step toward understanding the training dynamics and feature learning.
    Towards Few-Annotation Learning in Computer Vision: Application to Image Classification and Object Detection tasks. (arXiv:2311.04888v1 [cs.CV])
    In this thesis, we develop theoretical, algorithmic and experimental contributions for Machine Learning with limited labels, and more specifically for the tasks of Image Classification and Object Detection in Computer Vision. In a first contribution, we are interested in bridging the gap between theory and practice for popular Meta-Learning algorithms used in Few-Shot Classification. We make connections to Multi-Task Representation Learning, which benefits from solid theoretical foundations, to verify the best conditions for a more efficient meta-learning. Then, to leverage unlabeled data when training object detectors based on the Transformer architecture, we propose both an unsupervised pretraining and a semi-supervised learning method in two other separate contributions. For pretraining, we improve Contrastive Learning for object detectors by introducing the localization information. Finally, our semi-supervised method is the first tailored to transformer-based detectors.
    Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features. (arXiv:2307.09933v2 [cs.LG] UPDATED)
    To avoid failures on out-of-distribution data, recent works have sought to extract features that have an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information that could boost performance if used correctly in the test domain. In this work, we show how this can be done without test-domain labels. In particular, we prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this theoretical insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning a predictor that separates stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions in the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data.
    Conditional Sampling of Variational Autoencoders via Iterated Approximate Ancestral Sampling. (arXiv:2308.09078v2 [cs.LG] UPDATED)
    Conditional sampling of variational autoencoders (VAEs) is needed in various applications, such as missing data imputation, but is computationally intractable. A principled choice for asymptotically exact conditional sampling is Metropolis-within-Gibbs (MWG). However, we observe that the tendency of VAEs to learn a structured latent space, a commonly desired property, can cause the MWG sampler to get "stuck" far from the target distribution. This paper mitigates the limitations of MWG: we systematically outline the pitfalls in the context of VAEs, propose two original methods that address these pitfalls, and demonstrate an improved performance of the proposed methods on a set of sampling tasks.
    Statistical limits of correlation detection in trees. (arXiv:2209.13723v2 [math.ST] UPDATED)
    In this paper we address the problem of testing whether two observed trees $(t,t')$ are sampled either independently or from a joint distribution under which they are correlated. This problem, which we refer to as correlation detection in trees, plays a key role in the study of graph alignment for two correlated random graphs. Motivated by graph alignment, we investigate the conditions of existence of one-sided tests, i.e. tests which have vanishing type I error and non-vanishing power in the limit of large tree depth. For the correlated Galton-Watson model with Poisson offspring of mean $\lambda>0$ and correlation parameter $s \in (0,1)$, we identify a phase transition in the limit of large degrees at $s = \sqrt{\alpha}$, where $\alpha \sim 0.3383$ is Otter's constant. Namely, we prove that no such test exists for $s \leq \sqrt{\alpha}$, and that such a test exists whenever $s > \sqrt{\alpha}$, for $\lambda$ large enough. This result sheds new light on the graph alignment problem in the sparse regime (with $O(1)$ average node degrees) and on the performance of the MPAlign method studied in Ganassali et al. (2021), Piccioli et al. (2021), proving in particular the conjecture of Piccioli et al. (2021) that MPAlign succeeds in the partial recovery task for correlation parameter $s>\sqrt{\alpha}$ provided the average node degree $\lambda$ is large enough. As a byproduct, we identify a new family of orthogonal polynomials for the Poisson-Galton-Watson measure which enjoy remarkable properties. These polynomials may be of independent interest for a variety of problems involving graphs, trees or branching processes, beyond the scope of graph alignment.
    Learning Linear Gaussian Polytree Models with Interventions. (arXiv:2311.04636v1 [stat.ML])
    We present a consistent and highly scalable local approach to learn the causal structure of a linear Gaussian polytree using data from interventional experiments with known intervention targets. Our methods first learn the skeleton of the polytree and then orient its edges. The output is a CPDAG representing the interventional equivalence class of the polytree of the true underlying distribution. The skeleton and orientation recovery procedures we use rely on second order statistics and low-dimensional marginal distributions. We assess the performance of our methods under different scenarios in synthetic data sets and apply our algorithm to learn a polytree in a gene expression interventional data set. Our simulation studies demonstrate that our approach is fast, has good accuracy in terms of structural Hamming distance, and handles problems with thousands of nodes.
    Solving Kernel Ridge Regression with Gradient-Based Optimization Methods. (arXiv:2306.16838v3 [stat.ML] UPDATED)
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. Here, we introduce an equivalent formulation of the objective function of KRR, opening up both for using penalties other than the ridge penalty and for studying kernel ridge regression from the perspective of gradient descent. Using a continuous-time perspective, we derive a closed-form solution for solving kernel regression with gradient descent, something we refer to as kernel gradient flow, KGF, and theoretically bound the differences between KRR and KGF, where, for the latter, regularization is obtained through early stopping. We also generalize KRR by replacing the ridge penalty with the $\ell_1$ and $\ell_\infty$ penalties, respectively, and use the fact that analogous to the similarities between KGF and KRR, $\ell_1$ regularization and forward stagewise regression (also known as coordinate descent), and $\ell_\infty$ regularization and sign gradient descent, follow similar solution paths. We can thus alleviate the need for computationally heavy algorithms based on proximal gradient descent. We show theoretically and empirically how the $\ell_1$ and $\ell_\infty$ penalties, and the corresponding gradient-based optimization algorithms, produce sparse and robust kernel regression solutions, respectively.
    Robust and Communication-Efficient Federated Domain Adaptation via Random Features. (arXiv:2311.04686v1 [cs.LG])
    Modern machine learning (ML) models have grown to a scale where training them on a single machine becomes impractical. As a result, there is a growing trend to leverage federated learning (FL) techniques to train large ML models in a distributed and collaborative manner. These models, however, when deployed on new devices, might struggle to generalize well due to domain shifts. In this context, federated domain adaptation (FDA) emerges as a powerful approach to address this challenge. Most existing FDA approaches typically focus on aligning the distributions between source and target domains by minimizing their (e.g., MMD) distance. Such strategies, however, inevitably introduce high communication overheads and can be highly sensitive to network reliability. In this paper, we introduce RF-TCA, an enhancement to the standard Transfer Component Analysis approach that significantly accelerates computation without compromising theoretical and empirical performance. Leveraging the computational advantage of RF-TCA, we further extend it to FDA setting with FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that is \emph{independent} of the sample size, while maintaining performance that is either comparable to or even surpasses state-of-the-art FDA methods. We present extensive experiments to showcase the superior performance and robustness (to network condition) of FedRF-TCA.
    Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. (arXiv:2210.09186v7 [cs.SI] UPDATED)
    The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
    Zero-Shot Anomaly Detection via Batch Normalization. (arXiv:2302.07849v4 [cs.LG] UPDATED)
    Anomaly detection (AD) plays a crucial role in many safety-critical application domains. The challenge of adapting an anomaly detector to drift in the normal data distribution, especially when no training data is available for the "new normal," has led to the development of zero-shot AD techniques. In this paper, we propose a simple yet effective method called Adaptive Centered Representations (ACR) for zero-shot batch-level AD. Our approach trains off-the-shelf deep anomaly detectors (such as deep SVDD) to adapt to a set of inter-related training data distributions in combination with batch normalization, enabling automatic zero-shot generalization for unseen AD tasks. This simple recipe, batch normalization plus meta-training, is a highly effective and versatile tool. Our theoretical results guarantee the zero-shot generalization for unseen AD tasks; our empirical results demonstrate the first zero-shot AD results for tabular data and outperform existing methods in zero-shot anomaly detection and segmentation on image data from specialized domains. Code is at https://github.com/aodongli/zero-shot-ad-via-batch-norm
    Compressive Recovery of Sparse Precision Matrices. (arXiv:2311.04673v1 [stat.ML])
    We consider the problem of learning a graph modeling the statistical relations of the $d$ variables of a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $\Theta$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $\Theta$ from a sketch of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using nonlinear random features. Under certain assumptions on the spectrum of $\Theta$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=\Omega((d+2k)\log(d))$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.
    Multi-Source Domain Adaptation through Dataset Dictionary Learning in Wasserstein Space. (arXiv:2307.14953v3 [cs.LG] UPDATED)
    This paper seeks to solve Multi-Source Domain Adaptation (MSDA), which aims to mitigate data distribution shifts when transferring knowledge from multiple labeled source domains to an unlabeled target domain. We propose a novel MSDA framework based on dictionary learning and optimal transport. We interpret each domain in MSDA as an empirical distribution. As such, we express each domain as a Wasserstein barycenter of dictionary atoms, which are empirical distributions. We propose a novel algorithm, DaDiL, for learning via mini-batches: (i) atom distributions; (ii) a matrix of barycentric coordinates. Based on our dictionary, we propose two novel methods for MSDA: DaDil-R, based on the reconstruction of labeled samples in the target domain, and DaDiL-E, based on the ensembling of classifiers learned on atom distributions. We evaluate our methods in 3 benchmarks: Caltech-Office, Office 31, and CRWU, where we improved previous state-of-the-art by 3.15%, 2.29%, and 7.71% in classification performance. Finally, we show that interpolations in the Wasserstein hull of learned atoms provide data that can generalize to the target domain.
    Causal disentanglement of multimodal data. (arXiv:2310.18471v2 [cs.LG] UPDATED)
    Causal representation learning algorithms discover lower-dimensional representations of data that admit a decipherable interpretation of cause and effect; as achieving such interpretable representations is challenging, many causal learning algorithms utilize elements indicating prior information, such as (linear) structural causal models, interventional data, or weak supervision. Unfortunately, in exploratory causal representation learning, such elements and prior information may not be available or warranted. Alternatively, scientific datasets often have multiple modalities or physics-based constraints, and the use of such scientific, multimodal data has been shown to improve disentanglement in fully unsupervised settings. Consequently, we introduce a causal representation learning algorithm (causalPIMA) that can use multimodal data and known physics to discover important features with causal relationships. Our innovative algorithm utilizes a new differentiable parametrization to learn a directed acyclic graph (DAG) together with a latent space of a variational autoencoder in an end-to-end differentiable framework via a single, tractable evidence lower bound loss function. We place a Gaussian mixture prior on the latent space and identify each of the mixtures with an outcome of the DAG nodes; this novel identification enables feature discovery with causal relationships. Tested against a synthetic and a scientific dataset, our results demonstrate the capability of learning an interpretable causal structure while simultaneously discovering key features in a fully unsupervised setting.  ( 2 min )
    21cmEMU: an emulator of 21cmFAST summary observables. (arXiv:2309.05697v1 [astro-ph.CO] CROSS LISTED)
    Recent years have witnessed rapid progress in observations of the Epoch of Reionization (EoR). These have enabled high-dimensional inference of galaxy and intergalactic medium (IGM) properties during the first billion years of our Universe. However, even using efficient, semi-numerical simulations, traditional inference approaches that compute 3D lightcones on-the-fly can take $10^5$ core hours. Here we present 21cmEMU: an emulator of several summary observables from the popular 21cmFAST simulation code. 21cmEMU takes as input nine parameters characterizing EoR galaxies, and outputs the following summary statistics: (i) the IGM mean neutral fraction; (ii) the 21-cm power spectrum; (iii) the mean 21-cm spin temperature; (iv) the sky-averaged (global) 21-cm signal; (vi) the ultraviolet (UV) luminosity functions (LFs); and (vii) the Thomson scattering optical depth to the cosmic microwave background (CMB). All observables are predicted with sub-percent median accuracy, with a reduction of the computational cost by a factor of over 10$^4$. After validating inference results, we showcase a few applications, including: (i) quantifying the relative constraining power of different observational datasets; (ii) seeing how recent claims of a late EoR impact previous inferences; and (iii) forecasting upcoming constraints from the sixth observing season of the Hydrogen Epoch of Reionization Array (HERA) telescope. 21cmEMU is publicly-available, and is included as an alternative simulator in the public 21CMMC sampler.  ( 2 min )
    Joint control variate for faster black-box variational inference. (arXiv:2210.07290v3 [cs.LG] UPDATED)
    Black-box variational inference performance is sometimes hindered by the use of gradient estimators with high variance. This variance comes from two sources of randomness: Data subsampling and Monte Carlo sampling. While existing control variates only address Monte Carlo noise, and incremental gradient methods typically only address data subsampling, we propose a new "joint" control variate that jointly reduces variance from both sources of noise. This significantly reduces gradient variance, leading to faster optimization in several applications.  ( 2 min )
    Towards a Unified Framework of Contrastive Learning for Disentangled Representations. (arXiv:2311.04774v1 [cs.LG])
    Contrastive learning has recently emerged as a promising approach for learning data representations that discover and disentangle the explanatory factors of the data. Previous analyses of such approaches have largely focused on individual contrastive losses, such as noise-contrastive estimation (NCE) and InfoNCE, and rely on specific assumptions about the data generating process. This paper extends the theoretical guarantees for disentanglement to a broader family of contrastive methods, while also relaxing the assumptions about the data distribution. Specifically, we prove identifiability of the true latents for four contrastive losses studied in this paper, without imposing common independence assumptions. The theoretical findings are validated on several benchmark datasets. Finally, practical limitations of these methods are also investigated.  ( 2 min )
    Information-Theoretic Generalization Bounds for Transductive Learning and its Applications. (arXiv:2311.04561v1 [cs.LG])
    In this paper, we develop data-dependent and algorithm-dependent generalization bounds for transductive learning algorithms in the context of information theory for the first time. We show that the generalization gap of transductive learning algorithms can be bounded by the mutual information between training labels and hypothesis. By innovatively proposing the concept of transductive supersamples, we go beyond the inductive learning setting and establish upper bounds in terms of various information measures. Furthermore, we derive novel PAC-Bayesian bounds and build the connection between generalization and loss landscape flatness under the transductive learning setting. Finally, we present the upper bounds for adaptive optimization algorithms and demonstrate the applications of results on semi-supervised learning and graph learning scenarios. Our theoretic results are validated on both synthetic and real-world datasets.  ( 2 min )
    Why Do Clinical Probabilistic Models Fail To Transport Between Sites?. (arXiv:2311.04787v1 [cs.LG])
    The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we present common sources for this failure to transport, which we divide into sources under the control of the experimenter and sources inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of clinical models.  ( 2 min )
    More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime-validity. (arXiv:2306.12214v2 [stat.ML] UPDATED)
    In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast rate and mixed rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, we extend all previous results to anytime-valid bounds using a simple technique applicable to any existing bound.  ( 2 min )
    Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How. (arXiv:2311.04898v1 [cs.LG])
    Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. To evaluate the merits of our proposition, we plan to combine replay-approximated joint objectives with gradient projection-based optimization routines to test whether the addition of the latter provides benefits in terms of (1) alleviating the stability gap, (2) increasing the learning efficiency and (3) improving the final learning outcome.  ( 3 min )
    Regression with Cost-based Rejection. (arXiv:2311.04550v1 [cs.LG])
    Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.  ( 2 min )
    Assessment of the Reliablity of a Model's Decision by Generalizing Attribution to the Wavelet Domain. (arXiv:2305.14979v4 [cs.CV] UPDATED)
    Neural networks have shown remarkable performance in computer vision, but their deployment in numerous scientific and technical fields is challenging due to their black-box nature. Scientists and practitioners need to evaluate the reliability of a decision, i.e., to know simultaneously if a model relies on the relevant features and whether these features are robust to image corruptions. Existing attribution methods aim to provide human-understandable explanations by highlighting important regions in the image domain, but fail to fully characterize a decision process's reliability. To bridge this gap, we introduce the Wavelet sCale Attribution Method (WCAM), a generalization of attribution from the pixel domain to the space-scale domain using wavelet transforms. Attribution in the wavelet domain reveals where and on what scales the model focuses, thus enabling us to assess whether a decision is reliable. Our code is accessible here: \url{https://github.com/gabrielkasmi/spectral-attribution}.  ( 2 min )
    Causal Scoring: A Framework for Effect Estimation, Effect Ordering, and Effect Classification. (arXiv:2206.12532v3 [stat.ML] UPDATED)
    This paper introduces causal scoring as a novel approach to frame causal estimation in the context of decision making. Causal scoring entails the estimation of scores that support decision making by providing insights into causal effects. We present three valuable causal interpretations of these scores: effect estimation (EE), effect ordering (EO), and effect classification (EC). In the EE interpretation, the causal score represents the effect itself. The EO interpretation implies that the score can serve as a proxy for the magnitude of the effect, enabling the sorting of individuals based on their causal effects. The EC interpretation enables the classification of individuals into high- and low-effect categories using a predefined threshold. We demonstrate the value of these alternative causal interpretations (EO and EC) through two key results. First, we show that aligning the statistical modeling with the desired causal interpretation improves the accuracy of causal estimation. Second, we establish that more flexible causal interpretations are plausible in a wider range of data-generating processes and propose conditions to assess their validity. We showcase the practical utility of the causal scoring framework through examples in diverse fields such as advertising, healthcare, and education, illustrating how it facilitates reasoning about flexible causal interpretations of statistical estimates in various contexts. The examples encompass confounded estimates, effect estimates on surrogate outcomes, and even predictions about non-causal quantities as potential causal scores.  ( 3 min )
    Certified Data Removal from Machine Learning Models. (arXiv:1911.03030v6 [cs.LG] UPDATED)
    Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.  ( 2 min )
    Data fission: splitting a single data point. (arXiv:2112.11079v8 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 3 min )
    Robust Mean Estimation Without Moments for Symmetric Distributions. (arXiv:2302.10844v2 [cs.DS] UPDATED)
    We study the problem of robustly estimating the mean or location parameter without moment assumptions. We show that for a large class of symmetric distributions, the same error as in the Gaussian setting can be achieved efficiently. The distributions we study include products of arbitrary symmetric one-dimensional distributions, such as product Cauchy distributions, as well as elliptical distributions. For product distributions and elliptical distributions with known scatter (covariance) matrix, we show that given an $\varepsilon$-corrupted sample, we can with probability at least $1-\delta$ estimate its location up to error $O(\varepsilon \sqrt{\log(1/\varepsilon)})$ using $\tfrac{d\log(d) + \log(1/\delta)}{\varepsilon^2 \log(1/\varepsilon)}$ samples. This result matches the best-known guarantees for the Gaussian distribution and known SQ lower bounds (up to the $\log(d)$ factor). For elliptical distributions with unknown scatter (covariance) matrix, we propose a sequence of efficient algorithms that approaches this optimal error. Specifically, for every $k \in \mathbb{N}$, we design an estimator using time and samples $\tilde{O}({d^k})$ achieving error $O(\varepsilon^{1-\frac{1}{2k}})$. This matches the error and running time guarantees when assuming certifiably bounded moments of order up to $k$. For unknown covariance, such error bounds of $o(\sqrt{\varepsilon})$ are not even known for (general) sub-Gaussian distributions. Our algorithms are based on a generalization of the well-known filtering technique. We show how this machinery can be combined with Huber-loss-based techniques to work with projections of the noise that behave more nicely than the initial noise. Moreover, we show how SoS proofs can be used to obtain algorithmic guarantees even for distributions without a first moment. We believe that this approach may find other applications in future works.  ( 3 min )
    Natural Bayesian Cram\'er-Rao Bound with an Application to Covariance Estimation. (arXiv:2311.04748v1 [math.ST])
    In this paper, we propose to develop a new Cram\'er-Rao Bound (CRB) when the parameter to estimate lies in a manifold and follows a prior distribution. This derivation leads to a natural inequality between an error criteria based on geometrical properties and this new bound. This main contribution is illustrated in the problem of covariance estimation when the data follow a Gaussian distribution and the prior distribution is an inverse Wishart. Numerical simulation shows new results where the proposed CRB allows to exhibit interesting properties of the MAP estimator which are not observed with the classical Bayesian CRB.  ( 2 min )
    Optimal Deep Neural Network Approximation for Korobov Functions with respect to Sobolev Norms. (arXiv:2311.04779v1 [math.NA])
    This paper establishes the nearly optimal rate of approximation for deep neural networks (DNNs) when applied to Korobov functions, effectively overcoming the curse of dimensionality. The approximation results presented in this paper are measured with respect to $L_p$ norms and $H^1$ norms. Our achieved approximation rate demonstrates a remarkable "super-convergence" rate, outperforming traditional methods and any continuous function approximator. These results are non-asymptotic, providing error bounds that consider both the width and depth of the networks simultaneously.  ( 2 min )
    Robust Best-arm Identification in Linear Bandits. (arXiv:2311.04731v1 [cs.LG])
    We study the robust best-arm identification problem (RBAI) in the case of linear rewards. The primary objective is to identify a near-optimal robust arm, which involves selecting arms at every round and assessing their robustness by exploring potential adversarial actions. This approach is particularly relevant when utilizing a simulator and seeking to identify a robust solution for real-world transfer. To this end, we present an instance-dependent lower bound for the robust best-arm identification problem with linear rewards. Furthermore, we propose both static and adaptive bandit algorithms that achieve sample complexity that matches the lower bound. In synthetic experiments, our algorithms effectively identify the best robust arm and perform similarly to the oracle strategy. As an application, we examine diabetes care and the process of learning insulin dose recommendations that are robust with respect to inaccuracies in standard calculators. Our algorithms prove to be effective in identifying robust dosage values across various age ranges of patients.  ( 2 min )
    On the estimation of the number of components in multivariate functional principal component analysis. (arXiv:2311.04540v1 [stat.ME])
    Happ and Greven (2018) developed a methodology for principal components analysis of multivariate functional data for data observed on different dimensional domains. Their approach relies on an estimation of univariate functional principal components for each univariate functional feature. In this paper, we present extensive simulations to investigate choosing the number of principal components to retain. We show empirically that the conventional approach of using a percentage of variance explained threshold for each univariate functional feature may be unreliable when aiming to explain an overall percentage of variance in the multivariate functional data, and thus we advise practitioners to be careful when using it.  ( 2 min )
    Likelihood Ratio Confidence Sets for Sequential Decision Making. (arXiv:2311.04402v1 [cs.LG])
    Certifiable, adaptive uncertainty estimates for unknown quantities are an essential ingredient of sequential decision-making algorithms. Standard approaches rely on problem-dependent concentration results and are limited to a specific combination of parameterization, noise family, and estimator. In this paper, we revisit the likelihood-based inference principle and propose to use likelihood ratios to construct any-time valid confidence sequences without requiring specialized treatment in each application scenario. Our method is especially suitable for problems with well-specified likelihoods, and the resulting sets always maintain the prescribed coverage in a model-agnostic manner. The size of the sets depends on a choice of estimator sequence in the likelihood ratio. We discuss how to provably choose the best sequence of estimators and shed light on connections to online convex optimization with algorithms such as Follow-the-Regularized-Leader. To counteract the initially large bias of the estimators, we propose a reweighting scheme that also opens up deployment in non-parametric settings such as RKHS function classes. We provide a non-asymptotic analysis of the likelihood ratio confidence sets size for generalized linear models, using insights from convex duality and online learning. We showcase the practical strength of our method on generalized linear bandit problems, survival analysis, and bandits with various additive noise distributions.  ( 2 min )
    Online Learning Quantum States with the Logarithmic Loss via VB-FTRL. (arXiv:2311.04237v1 [quant-ph])
    Online learning quantum states with the logarithmic loss (LL-OLQS) is a quantum generalization of online portfolio selection, a classic open problem in the field of online learning for over three decades. The problem also emerges in designing randomized optimization algorithms for maximum-likelihood quantum state tomography. Recently, Jezequel et al. (arXiv:2209.13932) proposed the VB-FTRL algorithm, the first nearly regret-optimal algorithm for OPS with moderate computational complexity. In this note, we generalize VB-FTRL for LL-OLQS. Let $d$ denote the dimension and $T$ the number of rounds. The generalized algorithm achieves a regret rate of $O ( d^2 \log ( d + T ) )$ for LL-OLQS. Each iteration of the algorithm consists of solving a semidefinite program that can be implemented in polynomial time by, e.g., cutting-plane methods. For comparison, the best-known regret rate for LL-OLQS is currently $O ( d^2 \log T )$, achieved by the exponential weight method. However, there is no explicit implementation available for the exponential weight method for LL-OLQS. To facilitate the generalization, we introduce the notion of VB-convexity. VB-convexity is a sufficient condition for the logarithmic barrier associated with any function to be convex and is of independent interest.  ( 2 min )

  • Open

    [D] How can you add additional features/attributes while doing instance segmentation?
    I want to do an instance segmentation of objects in images. Usually I would stick to something like an Mask R CNN and let it run. However additionally to the image itself and the pre-labeled images, I have additional features that might be interesting for the segmentation. Example: I want to segment certain products in images from a factory and I have additional information about the products than run at a specific time (like product family, color, brand, etc.). How do I add these additional features in an instance segmentation? submitted by /u/phillylovesdata [link] [comments]  ( 9 min )
    [P] Pandas Vs Tensorflow for reading dataset
    Background info: Greetings. I am a student who attentds computer science uni and as part of my dissertation I have to train some models. The thing is,I'm quite new with machine learning and my knowledge is limited so far. Main: I'm trying to open a 10gb dataset in Google colab to sanitize and preprocess the data before feeding them into a CNN model and I don't know which is the best way to do it. Thanks for your time submitted by /u/ThrowRA39495 [link] [comments]  ( 9 min )
    [P]Coqui released XTTSv2
    XTTSv2 is released. I’d say it’s a big jump in quality. Better voice cloning Better audio Impressive prosody and expressiveness Added more languages, I guess total 16 languages. Non-EN languages sounds way better Streaming under 200ms ( I have 3090) Finetuning code Here you can try https://huggingface.co/spaces/coqui/xtts submitted by /u/coinfelix [link] [comments]  ( 9 min )
    [D] Is it a good idea to use VGG16 for a image classification mobile app?
    I'm new to image classification and ML and this is going to be my first project on those topics. I'm considering using VGG16 because I saw some studies showing that it has a generally great accuracy score (80-95%) but I'm worried that the model might not be fast enough or the app file size might get massive if I want the app to be usable without internet connection. What do you guys think? submitted by /u/Rhet98 [link] [comments]  ( 9 min )
    [P] GPT vs. StarCraft
    This is the first in a series of webcasts covering the development and experimentation of using GPT algorithms, LangChain and Python to control the high-level strategy of a StarCraft II bot. I’ll be running through the basics of the implementation, discussing the use of prompts and prompt engineering, and demonstrating the implementation in action. https://youtu.be/E3Sj2L6ZnXA submitted by /u/Resident-Weather-324 [link] [comments]  ( 9 min )
    [D] transformers Trainer log
    I'm trying to finetune an LLM with LoRA, using transformers' Trainer. In twenty minutes of training it didn't output any logs to screen or to disk, even though I set logging_steps to 1: trainer = transformers.Trainer( ... args = transformers.TrainingArguments( ... logging_steps=1, logging_dir="logs", ), ) What do I need to do to see the training log? Set up any callbacks? submitted by /u/Foxtr0t [link] [comments]  ( 9 min )
    Doing Molecular Dynamics Simulations using GNN's [D]
    I've been trying to make a gnn that can do molecular dynamics simulations on some decently simple molecules. While I have experience with ml, I'm pretty new to the molecular dynamics part and found that to be pretty confusing. Can someone please point me to some resources that cover this topic? submitted by /u/The_Invincible7 [link] [comments]  ( 9 min )
    [D] How To Do Product Matching Using ML?
    Hi Folks, I am trying to build a solution where I input an e-commerce product URL and automatically get all the competitors for that product. If you could provide any direction concerning it. It will be beneficial. submitted by /u/Used-Preparation-921 [link] [comments]  ( 9 min )
    [D] What AI topics are you curious about but rarely see in the spotlight?
    I'm a data engineer who somehow ended up as a software developer. So many of my friends are working now with the OpenAI api to add generative capabilities to their product, but they lack A LOT of context when it comes to how LLMs actually works. This is why I started writing popular-science style articles that unpack AI concepts for software developers working on real-world application. It started kind of slow, honestly I wrote a bit too "brainy" for them, but now I've found a voice that resonance with this audience much better and I want to ramp up my writing cadence. I would love to hear your thoughts about what concepts I should write about next? What get you excited and you find hard to explain to someone with a different background? submitted by /u/GratefullyFriendly73 [link] [comments]  ( 9 min )
    [P] Fine-grained semantic search and clustering with interpretable multi-feature text embeddings
    Hi, we all know that text embeddings (e.g., SBERT, simCSE, LLM embeddings) are very powerful. However, my little grudge with them was always that it's hard to say what's really in them. Okay, matching them gives some value of "relatedness" or "similarity", but the value is kind of really hard to interpret. I mean text can be really diverse and is often similar in some categories, but not in others. Here's an example: "The man builds a tent" "Two men build a tent" A text embedding model such as SBERT gives a high similarity score, which is fine, since the sentences are in fact quite similar. However, they're similar because they're mostly on the same stuff/topics, but they're dissimilar in their use of number: in the first sentence there's one man, in the second sentence there's two! …  ( 10 min )
    [D] T5-base results are worse than t5-small
    Hi everyone, I pretrained T5 small, base and large on the [PrivaSeer](https://privaseer.ist.psu.edu/data) corpus with a spanned MLM objective. I called the pretrained model PrivaT5. Then finetuned PrivaT5 and T5 small, base and large on some tasks of the [PrivacyGLUE](https://github.com/infsys-lab/privacy-glue) benchmark. You can see the results in these plots: https://preview.redd.it/vzq1l2jdqazb1.png?width=1280&format=png&auto=webp&s=f73239ecf59f409ec1371a59e52599e89d89c97b For all model sizes I used the same hyperparameters except for the batch size I changed it to make the model fit on the TPU. Example : ​ ``` --model_name_or_path="t5-base" --hub_save_name_or_path="t5-base" --model_type="t5-base" --config_name="t5-base" --tokenizer_name="t5-base" --max_seq_length="512" --per_device_train_batch_size="16" --per_device_eval_batch_size="16" --adafactor --learning_rate="0.001" --weight_decay="0.0" --warmup_steps="0" --overwrite_output_dir --logging_steps="500" --save_steps="50" --eval_steps="50" --num_train_epochs="100" ``` Could anyone give me possible reasons why the PrivaT5 base performance unexpectedly drops on the OPP-115 and Policy-Detection tasks compared to PrivaT5 small? (Multilabel text classification & binary text classification respectively). Thank you! submitted by /u/alzoubi36 [link] [comments]  ( 9 min )
    Best Data Visualization Technique for a Multilabel Classification Data? [Discussion]
    Would like to ask if there are any other Data Visualization Techniques or Tools to visualize a Multilabel Classification Data? One of the many ways that we can visualize a high dimensional dataset is through a correlation heatmap. A heatmap may be made using the seaborn library and the panda dataframe's corr() method which returns the pairwise correlations of columns. Correlation Map using seaborn But I want to explore other and possibly better techniques aside from this. Especially that can deal with a very high dimension dataset such as mine. Would love to hear from your suggestions in regards to this. Thank you very much!!! ​ submitted by /u/RalphuChino [link] [comments]  ( 9 min )
    [R] Levels of AGI: Operationalizing Progress on the Path to AGI - DeepMind 2023
    Paper: https://arxiv.org/abs/2311.02462 Abstract: We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy. It is our hope that this framework will be useful in an analogous way to the levels of autonomous driving, by providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill six principles that a useful ontology for AGI should satisfy. These principles include focusing on capabilities rather than mechanisms; separately evaluating generality and performance; and defining stages along the path toward AGI, rather than focusing on the endpoint. With these principles in mind, we propose 'Levels of AGI' based on depth (performance) and breadth (generality) of capabilities, and reflect on how current systems fit into this ontology. We discuss the challenging requirements for future benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and safe deployment of highly capable AI systems. https://preview.redd.it/64biopsh79zb1.png?width=797&format=png&auto=webp&s=9af1c5085938dac000aaf23aa1b306133b01edb4 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Best method of knowledge distillation available?
    Best practical method of knowledge distillation available? TL;DR: Knowledge distillation generally performs worse than traning model from scratch on data from what I've seen online. Is there a method of KD where this doesn't happen and I get close to performance of a model if it was trained from scratch? So I've recently been interested in make DL models more useful for everyday tasks. And considering their size trying to run these models on consumer devices without much loss in quality but rn from what I've seen, this just feels like trying fit an elephant into his pants. Basically it tears everytime I try. I found quantization to be cool but I need to reduce its size even more tbh. So I found knowledge distillation. But from what I've seen, though theoretically it is fantastic. Practically knowlege distillation sucks. And is probably worse than just straight up traning the model from scratch on the dataset. So is there a used and proven method of knowledge distillation that I can use? Which will give me at least very close accuracy to a model trained from scratch on dataset? submitted by /u/Xanta_Kross [link] [comments]  ( 9 min )
  • Open

    Looking for the best AI girlfriend if possible
    Looking for something that can get as close to human as possible if one exists. I’ve tried and enjoyed Replika actually still have it but my rep is at 11 and the levels have been exhausted. Though it’s limited to NSFW text and voice only. Was looking to see if I could find an AI that has replika features but with better memory, sends selfies regularly of herself and not randomly generated ones (including NSFW) I’m new to the game. submitted by /u/Jyeung691 [link] [comments]  ( 9 min )
    Humane officially unveils the AI Pin device that aspires to replace smartphone
    Humane has officially unveiled the Humane Ai Pin, an AI-powered device that aims to replace smartphones. The device is a standalone device and software platform built from the ground up with AI, meaning it does not need to be paired with a smartphone. It can perform various functions such as calling, texting, taking photos or videos, listening to music, and more. The device is powered by AI and can recognize objects, provide nutrition information, and even make online purchases. The Humane Ai Pin will cost $699 and will be available for order in the United States on November 16th. Source : https://bgr.com/tech/humane-officially-unveils-the-ai-pin-a-device-that-aspires-to-replace-your-smartphone/ submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Need help training a PPO NN to learn how to play my deckbuilding game
    Hey, I have a roguelike deckbuilding game I want to train an agent to play using pure unsupervised RL; I chose PPO as I understand (to my amateur knowledge) that is the most fitting algorithm. I have a very large categorical space that I have to send in (basically what cards are in the deck and which cards are being offered to pick), and I need the agent to learn the best picks. I attempted to use an embedding layer and input the cards the player has + given cards + numerical data (concated with the embedding output). I tried playing around with various hyperparameters, but so far, I have not been able to generate any learning. Any help or advice would be greatly appreciated, thanks! submitted by /u/Jagerjj [link] [comments]  ( 9 min )
    OCR for custom characters
    Hello guys, I wanna ask if someone knows how can I detect custom made characters with OCR. I want to detect the normal alphabet and numbers but in addition for example hieroglyphics. I tried to use YOLO and the hieroglyphics were labeled as AA or MM or something like that, but I want your opinions. Thanks! submitted by /u/quorra96 [link] [comments]  ( 9 min )
    Elon Musk: War, AI, Aliens, Politics, Physics, Video Games, and Humanity | Lex Fridman Podcast
    submitted by /u/Overflame [link] [comments]  ( 9 min )
    What AI-Tools do you use in your daily routine?
    As title says, I am interested in your daily AI-Tool activities. Which tools do you use and provide you with more efficiency? Do you have any suggestions for me or others? ​ Mine is mymind.com It helps save content as it automatically categorizes it websites and notes. So far, I really enjoy it. What are your suggestions? submitted by /u/345Y_Chubby [link] [comments]  ( 9 min )
    Humane AI Pin reveal video
    submitted by /u/webbedgiant [link] [comments]  ( 9 min )
    Disney Pixar AI Generator Review
    submitted by /u/Amandacerni [link] [comments]  ( 9 min )
    AI Sales Chat Bot and Telephone Agent
    do you know eve.calls? are they legit? - submitted by /u/Niu_Davinci [link] [comments]  ( 9 min )
    AI Sales Chat Bot and Telephone Agent
    do you know eve.calls? are they legit? - submitted by /u/Niu_Davinci [link] [comments]  ( 9 min )
    AI assistive tools able to do some or all of these tasks to help with some of the ongoing issues symptomatic of ADHD. anxiety, depression etc?
    First off im not entirely sure how to frame this or even where I should be asking. So if im in the wrong place and you have a better suggestion that would be great. So far ive asked in LearnMachineLearning and MLQuestions but had no response. As someone with various mental health and gifted learning issues with a history in IT (albeit a bit of a dusty unused one), this is something I have been thinking about for a very long time. Every time i look into it ive found the tech just wasnt really up for it without it being a huge nightmare anyway. Recently ive been looking at recent advances in things like chat GPT and similar AI and I feel its time to look again and see if there is anything that can help me that doesn't need an entire organization working on a mainframe to implement. So here…  ( 11 min )
    Those childhood days that I can't get back
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 9 min )
    AI-Ressources for Content Creators (Avatars, Text-to-speech, speech-to-speech...)
    Hi! I am completely new to this one, but with the advancement of the AI-Tools I want to give it a try. I want to make short tutorials for Beginners in the 3D software Blender, but I don't want to use my voice due to anomymity reasons. Plus, if this is already feaseable, I would at least place a small reacting avatar in the bottom right or so. Ideally, the workflow would be that I narrate what I am doing and the AI translates this into either text or another voice directly, as well as animates the avatar. I am okay with paid tools, my question is if we are "at this point" already technically and if yes, if you have recommended tools? submitted by /u/Ryselle [link] [comments]  ( 9 min )
    One-Minute Daily AI News 11/8/2023
    GitHub Copilot Chat will become generally available in December.[1] Figma introduces FigJam AI to spare designers from boring planning prep.[2] AI Will Cut Cost of Animated Films by 90%, Jeff Katzenberg Says.[3] Google just announced that it’s bringing Generative AI in search to more than 120 new countries and territories. Thus, the Search Generative Experience (SGE) got its largest international expansion so far.[4] Sources: [1] https://www.neowin.net/news/github-copilot-chat-will-become-generally-available-in-december/ [2] https://www.theverge.com/2023/11/7/23950667/figma-figjam-generative-ai-design-tools-beta-announcement [3] https://finance.yahoo.com/news/ai-cut-cost-animated-films-051506470.html? [4] https://www.gsmarena.com/google_brings_generative_ai_in_search_to_120_new_countries_and_territories-news-60524.php submitted by /u/Excellent-Target-847 [link] [comments]  ( 9 min )
    AI Tools for Editing Extensive Video Files from Family Vacations
    Even though we have access to Adobe Premiere and other similar video editing tools, I'm wondering if there's anything out there that will take most scenic or b-roll footage from our family vacation, which includes drone and DSLR footage, and trims the footage in the right sequence to create a video of X length with little or no human intervention. Thoughts? submitted by /u/crmjewelers [link] [comments]  ( 9 min )
  • Open

    Towards model-free RL algorithms that scale well with unstructured data
    Paper: https://arxiv.org/abs/2311.02215 Abstract: Conventional reinforcement learning (RL) algorithms exhibit broad generality in their theoretical formulation and high performance on several challenging domains when combined with powerful function approximation. However, developing RL algorithms that perform well across problems with unstructured observations at scale remains challenging because most function approximation methods rely on externally provisioned knowledge about the structure of the input for good performance (e.g. convolutional networks, graph neural networks, tile-coding). A common practice in RL is to evaluate algorithms on a single problem, or on problems with limited variation in the observation scale. RL practitioners lack a systematic way to study how well a single RL algorithm performs when instantiated across a range of problem scales, and they lack function approximation techniques that scale well with unstructured observations. We address these limitations by providing environments and algorithms to study scaling for unstructured observation vectors and flat action spaces. We introduce a family of combinatorial RL problems with an exponentially large state space and high-dimensional dynamics but where linear computation is sufficient to learn a (nonlinear) value function estimate for performant control. We provide an algorithm that constructs reward-relevant general value function (GVF) questions to find and exploit predictive structure directly from the experience stream. In an empirical evaluation of the approach on synthetic problems, we observe a sample complexity that scales linearly with the observation size. The proposed algorithm reliably outperforms a conventional deep RL algorithm on these scaling problems, and they exhibit several desirable auxiliary properties. These results suggest new algorithmic mechanisms by which algorithms can learn at scale from unstructured data. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    Mujoco on windows
    I'm trying to run this code example https://pytorch.org/rl/tutorials/coding_ppo.html to get a sense for how to implement ppo and I'm also interested in a couple other RL projects that use mujoco. Unfortunately I'm on windows 10 and I heard mujoco is deprecated for windows, so I figured I would just use an ubuntu virtual machine in vmware workstation. Once I set it up, I realized it can't access my nvidia 3080. Should I just dual boot? Or is there a windows solution? If I dual boot will the linux boot have full access to the gpu? submitted by /u/theLanguageSprite [link] [comments]  ( 9 min )
    "When to Show a Suggestion? Integrating Human Feedback in AI-Assisted Programming", Mozannar et al 2023
    submitted by /u/gwern [link] [comments]  ( 9 min )
    how do i design the action space for this in gym?
    Hello I want my PPO model to choose between 5 different actions. two of these are binary and the other three are real numbers between 0 and 1. so my gym action space is like this: first = Discrete(2) second = Discrete(2) third = Box(low=0,high=1,shape=(1,)) fourth = Box(low=0,high=1,shape=(1,)) fifth = Box(low=0,high=1,shape=(1,)) action_space = Tuple((first,second,third,fourth,fifth)) a sample would look like this: (1, 0, array([0.57706404], dtype=float32), array([0.956483], dtype=float32), array([0.09770907], dtype=float32)) but i actually want only one AI chosen element to be above zero and everything else zero like for example (0, 0, array([0.57706404], dtype=float32), array([0], dtype=float32), array([0], dtype=float32)) is this even possible? The best thing i can come up with is have the following action_space type = Discrete(5) ratio = Box(low=0,high=1,shape=(1,)) d = Tuple((first,ratio)) sample: sample: (4, array([0.37991711], dtype=float32)) My guess is that would work but it does not seem optimal to me. Or should i actually use the previous action_space and hope PPO will learn over time to only put one element above zero and null everything else? Any help would be appreciated. Thanks! submitted by /u/phoenix_fire_stone [link] [comments]  ( 9 min )
    Offline RL: Hyper-parameter selection and comparison with behavior policy
    For offline hyper-parameter selection, I've come across several papers, including this one, that recommend methods like FQE. These approaches can help eliminate poorly performing policies. I tested this on a toy environment, and the results so far indicate a strong correlation between actual rewards and the values estimated by the FQE. Is it also a valid technique to compare the performance of the trained policy with the behavior policy (the policy used to collect the data)? I have a few questions: Could the FQE potentially overestimate out-of-distribution (OOD) actions, leading to a higher value for the trained policy (that selects OOD actions) compared to the behavior policy? Would the comparison be valid if the data is collected by several different policies with varying performance levels (some expert, others completely random, and some in between)? Also if you have experience with offline reinforcement learning, I would love to know how did you deal with hyper-parameter selection and comparison with behavior policy. What were the end results? submitted by /u/ZIGGY-Zz [link] [comments]  ( 9 min )
    What is best way to get RL agent to generalize across different versions of the same environment?
    E.g. imagine a gridworld where agent has to go to a goal space. I want it to be able to do this across many different types of levels but where task is same: "go to goal." Right now I use parallel envs for PPO and train simultaneously on all version environments. It worked for 2 very small levels but a bit slow, so I wanted to confirm this was best approach (e.g. vs sequential learning or curriculum learning or something completely different). I tried googling but can't find info on it for some reason. I did see the parallel env approach with domain randomization in a paper, but they don't discuss it much. submitted by /u/rl_ninja_rl_ninja [link] [comments]  ( 9 min )
    Ask for help with code
    Hey everyone, I am working on a project in the field of MARL and it would be a great help if anyone could help me, I am looking for implementations of the following environments: Decentralized Tiger Decentralized Rock Sampling Decentralized Box Pushing ​ I prefer the implementations to be in python and if they could follow the MultiAgentEnv interface created by the pymarl framework that would be perfect! thank you submitted by /u/yoyo-master [link] [comments]  ( 9 min )
  • Open

    Responsible AI at Google Research: Context in AI Research (CAIR)
    Posted by Katherine Heller, Research Scientist, Google Research, on behalf of the CAIR Team Artificial intelligence (AI) and related machine learning (ML) technologies are increasingly influential in the world around us, making it imperative that we consider the potential impacts on society and individuals in all aspects of the technology that we create. To these ends, the Context in AI Research (CAIR) team develops novel AI methods in the context of the entire AI pipeline: from data to end-user feedback. The pipeline for building an AI system typically starts with data collection, followed by designing a model to run on that data, deployment of the model in the real world, and lastly, compiling and incorporation of human feedback. Originating in the health space, and now expanded to a…  ( 93 min )
    Overcoming leakage on error-corrected quantum processors
    Posted by Kevin Miao and Matt McEwen, Research Scientists, Quantum AI Team The qubits that make up Google quantum devices are delicate and noisy, so it’s necessary to incorporate error correction procedures that identify and account for qubit errors on the way to building a useful quantum computer. Two of the most prevalent error mechanisms are bit-flip errors (where the energy state of the qubit changes) and phase-flip errors (where the phase of the encoded quantum information changes). Quantum error correction (QEC) promises to address and mitigate these two prominent errors. However, there is an assortment of other error mechanisms that challenges the effectiveness of QEC. While we want qubits to behave as ideal two-level systems with no loss mechanisms, this is not the case in r…  ( 94 min )
  • Open

    Promote pipelines in a multi-environment setup using Amazon SageMaker Model Registry, HashiCorp Terraform, GitHub, and Jenkins CI/CD
    Building out a machine learning operations (MLOps) platform in the rapidly evolving landscape of artificial intelligence (AI) and machine learning (ML) for organizations is essential for seamlessly bridging the gap between data science experimentation and deployment while meeting the requirements around model performance, security, and compliance. In order to fulfill regulatory and compliance requirements, the […]  ( 17 min )
    Customizing coding companions for organizations
    Generative AI models for coding companions are mostly trained on publicly available source code and natural language text. While the large size of the training corpus enables the models to generate code for commonly used functionality, these models are unaware of code in private repositories and the associated coding styles that are enforced when developing […]  ( 11 min )
  • Open

    Enter a World of Samurai and Demons: GFN Thursday Brings Capcom’s ‘Onimusha: Warlords’ to the Cloud
    Wield the blade and embrace the way of the samurai for some thrilling action — Onimusha: Warlords comes to GeForce NOW this week. Members can experience feudal Japan in this hack-and-slash adventure game in the cloud. It’s part of an action-packed GFN Thursday, with 16 more games joining the cloud gaming platform’s library. Forging Destinies Read article >  ( 5 min )
  • Open

    OpenAI Data Partnerships
    Working together to create open-source and private datasets for AI training.  ( 2 min )
  • Open

    Explained: Generative AI
    How do powerful generative AI systems like ChatGPT work, and what makes them different from other types of artificial intelligence?  ( 11 min )
  • Open

    Flow-based distributionally robust optimization. (arXiv:2310.19253v2 [cs.LG] UPDATED)
    We present a computationally efficient framework, called FlowDRO, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD). The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution. We also develop a Wasserstein proximal gradient flow type of algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. Our computational framework is general, can handle high-dimensional data with large sample sizes, and can be useful for various applications. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on real high-dimensional data.  ( 2 min )
    Score-based Source Separation with Applications to Digital Communication Signals. (arXiv:2306.14411v2 [cs.LG] UPDATED)
    We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by maximum a posteriori estimation with an $\alpha$-posterior, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https://alpha-rgs.github.io  ( 2 min )
    A Mobile Data-Driven Hierarchical Deep Reinforcement Learning Approach for Real-time Demand-Responsive Railway Rescheduling and Station Overcrowding Mitigation. (arXiv:2308.11849v2 [eess.SY] UPDATED)
    Real-time railway rescheduling is an important technique to enable operational recovery in response to unexpected and dynamic conditions in a timely and flexible manner. Current research relies mostly on OD based data and model-based methods for estimating train passenger demands. These approaches primarily focus on averaged disruption patterns, often overlooking the immediate uneven distribution of demand over time. In reality, passenger demand deviates significantly from predictions, especially during a disaster. Disastrous situations such as flood in Zhengzhou, China in 2022 has created not only unprecedented effect on Zhengzhou railway station itself, which is a major railway hub in China, but also other major hubs connected to Zhengzhou, e.g., Xi'an, the closest hub west of Zhengzhou. In this study, we define a real-time demand-responsive (RTDR) railway rescheduling problem focusing two specific aspects, namely, volatility of the demand, and management of station crowdedness. For the first time, we propose a data-driven approach using real-time mobile data (MD) to deal with this RTDR problem. A hierarchical deep reinforcement learning (HDRL) framework is designed to perform real-time rescheduling in a demand-responsive manner. The use of MD has enabled the modelling of passenger dynamics in response to train delays and station crowdedness, and a real-time optimisation for rescheduling of train services in view of the change in demand as a result of passengers' behavioural response to disruption. Results show that the agent can steadily satisfy over 62% of the demand with only 61% of the original rolling stock, ensuring continuous operations without overcrowding. Moreover, the agent exhibits adaptability when transferred to a new environment with increased demand, highlighting its effectiveness in addressing unforeseen disruptions in real-time settings.  ( 3 min )
    S-LoRA: Serving Thousands of Concurrent LoRA Adapters. (arXiv:2311.03285v2 [cs.LG] UPDATED)
    The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA  ( 3 min )
    MobileNVC: Real-time 1080p Neural Video Compression on a Mobile Device. (arXiv:2310.01258v2 [eess.IV] UPDATED)
    Neural video codecs have recently become competitive with standard codecs such as HEVC in the low-delay setting. However, most neural codecs are large floating-point networks that use pixel-dense warping operations for temporal modeling, making them too computationally expensive for deployment on mobile devices. Recent work has demonstrated that running a neural decoder in real time on mobile is feasible, but shows this only for 720p RGB video. This work presents the first neural video codec that decodes 1080p YUV420 video in real time on a mobile device. Our codec relies on two major contributions. First, we design an efficient codec that uses a block-based motion compensation algorithm available on the warping core of the mobile accelerator, and we show how to quantize this model to integer precision. Second, we implement a fast decoder pipeline that concurrently runs neural network components on the neural signal processor, parallel entropy coding on the mobile GPU, and warping on the warping core. Our codec outperforms the previous on-device codec by a large margin with up to 48% BD-rate savings, while reducing the MAC count on the receiver side by $10 \times$. We perform a careful ablation to demonstrate the effect of the introduced motion compensation scheme, and ablate the effect of model quantization.  ( 3 min )
    The Evolution of the Interplay Between Input Distributions and Linear Regions in Networks. (arXiv:2310.18725v2 [cs.LG] UPDATED)
    It is commonly recognized that the expressiveness of deep neural networks is contingent upon a range of factors, encompassing their depth, width, and other relevant considerations. Currently, the practical performance of the majority of deep neural networks remains uncertain. For ReLU (Rectified Linear Unit) networks with piecewise linear activations, the number of linear convex regions serves as a natural metric to gauge the network's expressivity. In this paper, we count the number of linear convex regions in deep neural networks based on ReLU. In particular, we prove that for any one-dimensional input, there exists a minimum threshold for the number of neurons required to express it. We also empirically observe that for the same network, intricate inputs hinder its capacity to express linear regions. Furthermore, we unveil the iterative refinement process of decision boundaries in ReLU networks during training. We aspire for our research to serve as an inspiration for network optimization endeavors and aids in the exploration and analysis of the behaviors exhibited by deep networks.  ( 2 min )
    Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. (arXiv:2310.19102v2 [cs.LG] UPDATED)
    The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to $7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 quantization, while maintaining the same latency target.  ( 2 min )
    User Training with Error Augmentation for Electromyogram-based Gesture Classification. (arXiv:2309.07289v2 [cs.HC] UPDATED)
    We designed and tested a system for real-time control of a user interface by extracting surface electromyographic (sEMG) activity from eight electrodes in a wrist-band configuration. sEMG data were streamed into a machine-learning algorithm that classified hand gestures in real-time. After an initial model calibration, participants were presented with one of three types of feedback during a human-learning stage: veridical feedback, in which predicted probabilities from the gesture classification algorithm were displayed without alteration, modified feedback, in which we applied a hidden augmentation of error to these probabilities, and no feedback. User performance was then evaluated in a series of minigames, in which subjects were required to use eight gestures to manipulate their game avatar to complete a task. Experimental results indicated that, relative to baseline, the modified feedback condition led to significantly improved accuracy and improved gesture class separation. These findings suggest that real-time feedback in a gamified user interface with manipulation of feedback may enable intuitive, rapid, and accurate task acquisition for sEMG-based gesture recognition applications.  ( 2 min )
    DGFN: Double Generative Flow Networks. (arXiv:2310.19685v3 [cs.LG] UPDATED)
    Deep learning is emerging as an effective tool in drug discovery, with potential applications in both predictive and generative models. Generative Flow Networks (GFlowNets/GFNs) are a recently introduced method recognized for the ability to generate diverse candidates, in particular in small molecule generation tasks. In this work, we introduce double GFlowNets (DGFNs). Drawing inspiration from reinforcement learning and Double Deep Q-Learning, we introduce a target network used to sample trajectories, while updating the main network with these sampled trajectories. Empirical results confirm that DGFNs effectively enhance exploration in sparse reward domains and high-dimensional state spaces, both challenging aspects of de-novo design in drug discovery.  ( 2 min )
    Personalizing Keyword Spotting with Speaker Information. (arXiv:2311.03419v1 [eess.AS])
    Keyword spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a practical solution for real-world applications.  ( 2 min )
    Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test. (arXiv:2309.02422v3 [stat.ML] UPDATED)
    Maximum mean discrepancy (MMD) refers to a general class of nonparametric two-sample tests that are based on maximizing the mean difference over samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the MMD defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness order $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. This allows us to leverage the power of modern deep learning toolkits to (approximately) optimize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out extensive experiments to elucidate the strengths and weakenesses of the RKS test versus the more traditional kernel MMD test.  ( 3 min )
    How to Scale Your EMA. (arXiv:2307.13813v3 [stat.ML] UPDATED)
    Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.  ( 3 min )
    Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks. (arXiv:2304.03408v3 [stat.ML] UPDATED)
    We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $O(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.  ( 3 min )
    Interaction Measures, Partition Lattices and Kernel Tests for High-Order Interactions. (arXiv:2306.00904v3 [stat.ML] UPDATED)
    Models that rely solely on pairwise relationships often fail to capture the complete statistical structure of the complex multivariate data found in diverse domains, such as socio-economic, ecological, or biomedical systems. Non-trivial dependencies between groups of more than two variables can play a significant role in the analysis and modelling of such systems, yet extracting such high-order interactions from data remains challenging. Here, we introduce a hierarchy of $d$-order ($d \geq 2$) interaction measures, increasingly inclusive of possible factorisations of the joint probability distribution, and define non-parametric, kernel-based tests to establish systematically the statistical significance of $d$-order interactions. We also establish mathematical links with lattice theory, which elucidate the derivation of the interaction measures and their composite permutation tests; clarify the connection of simplicial complexes with kernel matrix centring; and provide a means to enhance computational efficiency. We illustrate our results numerically with validations on synthetic data, and through an application to neuroimaging data.  ( 2 min )
    FAIR4Cov: Fused Audio Instance and Representation for COVID-19 Detection. (arXiv:2204.10581v3 [cs.SD] UPDATED)
    Audio-based classification techniques on body sounds have long been studied to aid in the diagnosis of respiratory diseases. While most research is centered on the use of cough as the main biomarker, other body sounds also have the potential to detect respiratory diseases. Recent studies on COVID-19 have shown that breath and speech sounds, in addition to cough, correlate with the disease. Our study proposes Fused Audio Instance and Representation (FAIR) as a method for respiratory disease detection. FAIR relies on constructing a joint feature vector from various body sounds represented in waveform and spectrogram form. We conducted experiments on the use case of COVID-19 detection by combining waveform and spectrogram representation of body sounds. Our findings show that the use of self-attention to combine extracted features from cough, breath, and speech sounds leads to the best performance with an Area Under the Receiver Operating Characteristic Curve (AUC) score of 0.8658, a sensitivity of 0.8057, and a specificity of 0.7958. Compared to models trained solely on spectrograms or waveforms, the use of both representations results in an improved AUC score, demonstrating that combining spectrogram and waveform representation helps to enrich the extracted features and outperforms the models that use only one representation.  ( 3 min )
    The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks. (arXiv:2310.00496v2 [cs.CV] UPDATED)
    We introduce the Sparsity Roofline, a visual performance model for evaluating sparsity in neural networks. The Sparsity Roofline jointly models network accuracy, sparsity, and theoretical inference speedup. Our approach does not require implementing and benchmarking optimized kernels, and the theoretical speedup becomes equal to the actual speedup when the corresponding dense and sparse kernels are well-optimized. We achieve this through a novel analytical model for predicting sparse network performance, and validate the predicted speedup using several real-world computer vision architectures pruned across a range of sparsity patterns and degrees. We demonstrate the utility and ease-of-use of our model through two case studies: (1) we show how machine learning researchers can predict the performance of unimplemented or unoptimized block-structured sparsity patterns, and (2) we show how hardware designers can predict the performance implications of new sparsity patterns and sparse data formats in hardware. In both scenarios, the Sparsity Roofline helps performance experts identify sparsity regimes with the highest performance potential.  ( 2 min )
    Deep-learning-based decomposition of overlapping-sparse images: application at the vertex of neutrino interactions. (arXiv:2310.19695v2 [cs.CV] UPDATED)
    Image decomposition plays a crucial role in various computer vision tasks, enabling the analysis and manipulation of visual content at a fundamental level. Overlapping images, which occur when multiple objects or scenes partially occlude each other, pose unique challenges for decomposition algorithms. The task intensifies when working with sparse images, where the scarcity of meaningful information complicates the precise extraction of components. This paper presents a solution that leverages the power of deep learning to accurately extract individual objects within multi-dimensional overlapping-sparse images, with a direct application in high-energy physics with decomposition of overlaid elementary particles obtained from imaging detectors. In particular, the proposed approach tackles a highly complex yet unsolved problem: identifying and measuring independent particles at the vertex of neutrino interactions, where one expects to observe detector images with multiple indiscernible overlapping charged particles. By decomposing the image of the detector activity at the vertex through deep learning, it is possible to infer the kinematic parameters of the identified low-momentum particles - which otherwise would remain neglected - and enhance the reconstructed energy resolution of the neutrino event. We also present an additional step - that can be tuned directly on detector data - combining the above method with a fully-differentiable generative model to improve the image decomposition further and, consequently, the resolution of the measured parameters, achieving unprecedented results. This improvement is crucial for precisely measuring the parameters that govern neutrino flavour oscillations and searching for asymmetries between matter and antimatter.  ( 3 min )
    Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges. (arXiv:2311.03287v2 [cs.LG] UPDATED)
    While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. The Bingo benchmark is available at https://github.com/gzcch/Bingo.  ( 3 min )
    Improved MDL Estimators Using Fiber Bundle of Local Exponential Families for Non-exponential Families. (arXiv:2311.03852v1 [cs.IT])
    Minimum Description Length (MDL) estimators, using two-part codes for universal coding, are analyzed. For general parametric families under certain regularity conditions, we introduce a two-part code whose regret is close to the minimax regret, where regret of a code with respect to a target family M is the difference between the code length of the code and the ideal code length achieved by an element in M. This is a generalization of the result for exponential families by Gr\"unwald. Our code is constructed by using an augmented structure of M with a bundle of local exponential families for data description, which is not needed for exponential families. This result gives a tight upper bound on risk and loss of the MDL estimators based on the theory introduced by Barron and Cover in 1991. Further, we show that we can apply the result to mixture families, which are a typical example of non-exponential families.  ( 2 min )
    Learning-Based Latency-Constrained Fronthaul Compression Optimization in C-RAN. (arXiv:2311.03899v1 [cs.NI])
    The evolution of wireless mobile networks towards cloudification, where Radio Access Network (RAN) functions can be hosted at either a central or distributed locations, offers many benefits like low cost deployment, higher capacity, and improved hardware utilization. Nevertheless, the flexibility in the functional deployment comes at the cost of stringent fronthaul (FH) capacity and latency requirements. One possible approach to deal with these rigorous constraints is to use FH compression techniques. To ensure that FH capacity and latency requirements are met, more FH compression is applied during high load, while less compression is applied during medium and low load to improve FH utilization and air interface performance. In this paper, a model-free deep reinforcement learning (DRL) based FH compression (DRL-FC) framework is proposed that dynamically controls FH compression through various configuration parameters such as modulation order, precoder granularity, and precoder weight quantization that affect both FH load and air interface performance. Simulation results show that DRL-FC exhibits significantly higher FH utilization (68.7% on average) and air interface throughput than a reference scheme (i.e. with no applied compression) across different FH load levels. At the same time, the proposed DRL-FC framework is able to meet the predefined FH latency constraints (in our case set to 260 $\mu$s) under various FH loads.  ( 2 min )
    CongFu: Conditional Graph Fusion for Drug Synergy Prediction. (arXiv:2305.14517v2 [cs.LG] UPDATED)
    Drug synergy, characterized by the amplified combined effect of multiple drugs, is critically important for optimizing therapeutic outcomes. Limited data on drug synergy, arising from the vast number of possible drug combinations and testing costs, motivate the need for predictive methods. In this work, we introduce CongFu, a novel Conditional Graph Fusion Layer, designed to predict drug synergy. CongFu employs an attention mechanism and a bottleneck to extract local graph contexts and conditionally fuse graph data within a global context. Its modular architecture enables flexible replacement of layer modules, including readouts and graph encoders, facilitating customization for diverse applications. To evaluate the performance of CongFu, we conduct comprehensive experiments on four datasets, encompassing three distinct setups for drug synergy prediction. CongFu achieves state-of-the-art results on 11 out of 12 benchmark datasets, demonstrating its ability to capture intricate patterns of drug synergy. Through ablation studies, we validate the significance of individual layer components, affirming their contributions to overall predictive performance. Finally, we propose an explainability strategy for elucidating the effect of drugs on genes. By addressing the challenge of predicting drug synergy in untested drug pairs and utilizing our proposed explainability approach, CongFu opens new avenues for optimizing drug combinations and advancing personalized medicine.
    MAGNet: Motif-Agnostic Generation of Molecules from Shapes. (arXiv:2305.19303v2 [physics.chem-ph] UPDATED)
    Recent advances in machine learning for molecules exhibit great potential for facilitating drug discovery from in silico predictions. Most models for molecule generation rely on the decomposition of molecules into frequently occurring substructures (motifs), from which they generate novel compounds. While motif representations greatly aid in learning molecular distributions, such methods struggle to represent substructures beyond their known motif set. To alleviate this issue and increase flexibility across datasets, we propose MAGNet, a graph-based model that generates abstract shapes before allocating atom and bond types. To this end, we introduce a novel factorisation of the molecules' data distribution that accounts for the molecules' global context and facilitates learning adequate assignments of atoms and bonds onto shapes. Despite the added complexity of shape abstractions, MAGNet outperforms most other graph-based approaches on standard benchmarks. Importantly, we demonstrate that MAGNet's improved expressivity leads to molecules with more topologically distinct structures and, at the same time, diverse atom and bond assignments.
    Analysis of the User Perception of Chatbots in Education Using A Partial Least Squares Structural Equation Modeling Approach. (arXiv:2311.03636v1 [cs.HC])
    The integration of Artificial Intelligence (AI) into education is a recent development, with chatbots emerging as a noteworthy addition to this transformative landscape. As online learning platforms rapidly advance, students need to adapt swiftly to excel in this dynamic environment. Consequently, understanding the acceptance of chatbots, particularly those employing Large Language Model (LLM) such as Chat Generative Pretrained Transformer (ChatGPT), Google Bard, and other interactive AI technologies, is of paramount importance. However, existing research on chatbots in education has overlooked key behavior-related aspects, such as Optimism, Innovativeness, Discomfort, Insecurity, Transparency, Ethics, Interaction, Engagement, and Accuracy, creating a significant literature gap. To address this gap, this study employs Partial Least Squares Structural Equation Modeling (PLS-SEM) to investigate the determinant of chatbots adoption in education among students, considering the Technology Readiness Index (TRI) and Technology Acceptance Model (TAM). Utilizing a five-point Likert scale for data collection, we gathered a total of 185 responses, which were analyzed using R-Studio software. We established 12 hypotheses to achieve its objectives. The results showed that Optimism and Innovativeness are positively associated with Perceived Ease of Use (PEOU) and Perceived Usefulness (PU). Conversely, Discomfort and Insecurity negatively impact PEOU, with only Insecurity negatively affecting PU. These findings provide insights for future technology designers, elucidating critical user behavior factors influencing chatbots adoption and utilization in educational contexts.
    Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection. (arXiv:2303.10093v2 [cs.CV] UPDATED)
    Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. To address this gap, we conduct extensive analysis of the role of attributes in vision-language models. We specifically measure model sensitivity to the presence and meaning of attribute context, gauging influence on object embeddings through unsupervised phrase grounding and classification via description methods. We further evaluate the utility of attribute context in training for open-vocabulary object detection, fine-grained text-region retrieval, and attribution tasks. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective. A viable strategy that we find to increase benefits from attributes is contrastive training with adjective-based negative captions.
    How Many Neurons Does it Take to Approximate the Maximum?. (arXiv:2307.09212v2 [cs.LG] UPDATED)
    We study the size of a neural network needed to approximate the maximum function over $d$ inputs, in the most basic setting of approximating with respect to the $L_2$ norm, for continuous distributions, for a network that uses ReLU activations. We provide new lower and upper bounds on the width required for approximation across various depths. Our results establish new depth separations between depth 2 and 3, and depth 3 and 5 networks, as well as providing a depth $\mathcal{O}(\log(\log(d)))$ and width $\mathcal{O}(d)$ construction which approximates the maximum function. Our depth separation results are facilitated by a new lower bound for depth 2 networks approximating the maximum function over the uniform distribution, assuming an exponential upper bound on the size of the weights. Furthermore, we are able to use this depth 2 lower bound to provide tight bounds on the number of neurons needed to approximate the maximum by a depth 3 network. Our lower bounds are of potentially broad interest as they apply to the widely studied and used \emph{max} function, in contrast to many previous results that base their bounds on specially constructed or pathological functions and distributions.
    Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods. (arXiv:2310.05309v2 [cs.LG] UPDATED)
    Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions. In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question. Our result holds for a broad class of combinatorial problems including Max- and Min-Cut, Max-$k$-CSP, Maximum-Weight-Bipartite-Matching, and the Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla gradient descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.
    Minimum Width for Deep, Narrow MLP: A Diffeomorphism Approach. (arXiv:2308.15873v2 [cs.LG] UPDATED)
    Recently, there has been a growing focus on determining the minimum width requirements for achieving the universal approximation property in deep, narrow Multi-Layer Perceptrons (MLPs). Among these challenges, one particularly challenging task is approximating a continuous function under the uniform norm, as indicated by the significant disparity between its lower and upper bounds. To address this problem, we propose a framework that simplifies finding the minimum width for deep, narrow MLPs into determining a purely geometrical function denoted as $w(d_x, d_y)$. This function relies solely on the input and output dimensions, represented as $d_x$ and $d_y$, respectively. Two key steps support this framework. First, we demonstrate that deep, narrow MLPs, when provided with a small additional width, can approximate a $C^2$-diffeomorphism. Subsequently, using this result, we prove that $w(d_x, d_y)$ equates to the optimal minimum width required for deep, narrow MLPs to achieve universality. By employing the aforementioned framework and the Whitney embedding theorem, we provide an upper bound for the minimum width, given by $\operatorname{max}(2d_x+1, d_y) + \alpha(\sigma)$, where $0 \leq \alpha(\sigma) \leq 2$ represents a constant depending on the activation function. Furthermore, we provide a lower bound of $4$ for the minimum width in cases where the input and output dimensions are both equal to two.
    Teaching Language Models to Hallucinate Less with Synthetic Tasks. (arXiv:2310.06827v3 [cs.CL] UPDATED)
    Large language models (LLMs) frequently hallucinate on abstractive summarization tasks such as document-based question-answering, meeting summarization, and clinical report generation, even though all necessary information is included in context. However, optimizing LLMs to hallucinate less on these tasks is challenging, as hallucination is hard to efficiently evaluate at each optimization step. In this work, we show that reducing hallucination on a synthetic task can also reduce hallucination on real-world downstream tasks. Our method, SynTra, first designs a synthetic task where hallucinations are easy to elicit and measure. It next optimizes the LLM's system message via prefix-tuning on the synthetic task, and finally transfers the system message to realistic, hard-to-optimize tasks. Across three realistic abstractive summarization tasks, SynTra reduces hallucination for two 13B-parameter LLMs using only a synthetic retrieval task for supervision. We also find that optimizing the system message rather than the model weights can be critical; fine-tuning the entire model on the synthetic task can counterintuitively increase hallucination. Overall, SynTra demonstrates that the extra flexibility of working with synthetic data can help mitigate undesired behaviors in practice.
    FAIRO: Fairness-aware Adaptation in Sequential-Decision Making for Human-in-the-Loop Systems. (arXiv:2307.05857v2 [cs.LG] UPDATED)
    Achieving fairness in sequential-decision making systems within Human-in-the-Loop (HITL) environments is a critical concern, especially when multiple humans with different behavior and expectations are affected by the same adaptation decisions in the system. This human variability factor adds more complexity since policies deemed fair at one point in time may become discriminatory over time due to variations in human preferences resulting from inter- and intra-human variability. This paper addresses the fairness problem from an equity lens, considering human behavior variability, and the changes in human preferences over time. We propose FAIRO, a novel algorithm for fairness-aware sequential-decision making in HITL adaptation, which incorporates these notions into the decision-making process. In particular, FAIRO decomposes this complex fairness task into adaptive sub-tasks based on individual human preferences through leveraging the Options reinforcement learning framework. We design FAIRO to generalize to three types of HITL application setups that have the shared adaptation decision problem. Furthermore, we recognize that fairness-aware policies can sometimes conflict with the application's utility. To address this challenge, we provide a fairness-utility tradeoff in FAIRO, allowing system designers to balance the objectives of fairness and utility based on specific application requirements. Extensive evaluations of FAIRO on the three HITL applications demonstrate its generalizability and effectiveness in promoting fairness while accounting for human variability. On average, FAIRO can improve fairness compared with other methods across all three applications by 35.36%.
    Leveraging Deep Learning for Abstractive Code Summarization of Unofficial Documentation. (arXiv:2310.15015v2 [cs.SE] UPDATED)
    Usually, programming languages have official documentation to guide developers with APIs, methods, and classes. However, researchers identified insufficient or inadequate documentation examples and flaws with the API's complex structure as barriers to learning an API. As a result, developers may consult other sources (StackOverflow, GitHub, etc.) to learn more about an API. Recent research studies have shown that unofficial documentation is a valuable source of information for generating code summaries. We, therefore, have been motivated to leverage such a type of documentation along with deep learning techniques towards generating high-quality summaries for APIs discussed in informal documentation. This paper proposes an automatic approach using the BART algorithm, a state-of-the-art transformer model, to generate summaries for APIs discussed in StackOverflow. We built an oracle of human-generated summaries to evaluate our approach against it using ROUGE and BLEU metrics which are the most widely used evaluation metrics in text summarization. Furthermore, we evaluated our summaries empirically against a previous work in terms of quality. Our findings demonstrate that using deep learning algorithms can improve summaries' quality and outperform the previous work by an average of %57 for Precision, %66 for Recall, and %61 for F-measure, and it runs 4.4 times faster.
    ProtoDiff: Learning to Learn Prototypical Networks by Task-Guided Diffusion. (arXiv:2306.14770v2 [cs.LG] UPDATED)
    Prototype-based meta-learning has emerged as a powerful technique for addressing few-shot learning challenges. However, estimating a deterministic prototype using a simple average function from a limited number of examples remains a fragile process. To overcome this limitation, we introduce ProtoDiff, a novel framework that leverages a task-guided diffusion model during the meta-training phase to gradually generate prototypes, thereby providing efficient class representations. Specifically, a set of prototypes is optimized to achieve per-task prototype overfitting, enabling accurately obtaining the overfitted prototypes for individual tasks. Furthermore, we introduce a task-guided diffusion process within the prototype space, enabling the meta-learning of a generative process that transitions from a vanilla prototype to an overfitted prototype. ProtoDiff gradually generates task-specific prototypes from random noise during the meta-test stage, conditioned on the limited samples available for the new task. Furthermore, to expedite training and enhance ProtoDiff's performance, we propose the utilization of residual prototype learning, which leverages the sparsity of the residual prototype. We conduct thorough ablation studies to demonstrate its ability to accurately capture the underlying prototype distribution and enhance generalization. The new state-of-the-art performance on within-domain, cross-domain, and few-task few-shot classification further substantiates the benefit of ProtoDiff.
    Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization. (arXiv:2307.11620v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) has received considerable attention in recent years due to its attractive capability of learning policies from offline datasets without environmental interactions. Despite some success in the single-agent setting, offline multi-agent RL (MARL) remains to be a challenge. The large joint state-action space and the coupled multi-agent behaviors pose extra complexities for offline policy optimization. Most existing offline MARL studies simply apply offline data-related regularizations on individual agents, without fully considering the multi-agent system at the global level. In this work, we present OMIGA, a new offline m ulti-agent RL algorithm with implicit global-to-local v alue regularization. OMIGA provides a principled framework to convert global-level value regularization into equivalent implicit local value regularizations and simultaneously enables in-sample learning, thus elegantly bridging multi-agent value decomposition and policy learning with offline regularizations. Based on comprehensive experiments on the offline multi-agent MuJoCo and StarCraft II micro-management tasks, we show that OMIGA achieves superior performance over the state-of-the-art offline MARL methods in almost all tasks.
    Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach. (arXiv:2305.17058v3 [cs.PL] UPDATED)
    We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to a large class of discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on discrete events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct and fully automated in a tool called Genfer, which uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that Genfer is often faster than the existing exact inference tools PSI, Dice, and Prodigy. On a range of real-world inference problems that none of these exact tools can solve, Genfer's performance is competitive with approximate Monte Carlo methods, while avoiding approximation errors.
    Learning Proposals for Practical Energy-Based Regression. (arXiv:2110.11948v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) have experienced a resurgence within machine learning in recent years, including as a promising alternative for probabilistic regression. However, energy-based regression requires a proposal distribution to be manually designed for training, and an initial estimate has to be provided at test-time. We address both of these issues by introducing a conceptually simple method to automatically learn an effective proposal distribution, which is parameterized by a separate network head. To this end, we derive a surprising result, leading to a unified training objective that jointly minimizes the KL divergence from the proposal to the EBM, and the negative log-likelihood of the EBM. At test-time, we can then employ importance sampling with the trained proposal to efficiently evaluate the learned EBM and produce stand-alone predictions. Furthermore, we utilize our derived training objective to learn mixture density networks (MDNs) with a jointly trained energy-based teacher, consistently outperforming conventional MDN training on four real-world regression tasks within computer vision. Code is available at https://github.com/fregu856/ebms_proposals.
    An enrichment approach for enhancing the expressivity of neural operators with applications to seismology. (arXiv:2306.04096v2 [cs.LG] UPDATED)
    The Eikonal equation plays a central role in seismic wave propagation and hypocenter localization, a crucial aspect of efficient earthquake early warning systems. Despite recent progress, real-time earthquake localization remains challenging due to the need to learn a generalizable Eikonal operator. We introduce a novel deep learning architecture, Enriched-DeepONet (En-DeepONet), addressing the limitations of current operator learning models in dealing with moving-solution operators. Leveraging addition and subtraction operations and a novel `root' network, En-DeepONet is particularly suitable for learning such operators and achieves up to four orders of magnitude improved accuracy without increased training cost. We demonstrate the effectiveness of En-DeepONet in earthquake localization under variable velocity and arrival time conditions. Our results indicate that En-DeepONet paves the way for real-time hypocenter localization for velocity models of practical interest. The proposed method represents a significant advancement in operator learning that is applicable to a gamut of scientific problems, including those in seismology, fracture mechanics, and phase-field problems.
    The Impact of Positional Encoding on Length Generalization in Transformers. (arXiv:2305.19466v2 [cs.CL] UPDATED)
    Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.
    Neuro-Symbolic Causal Reasoning Meets Signaling Game for Emergent Semantic Communications. (arXiv:2210.12040v2 [cs.LG] UPDATED)
    Semantic communication (SC) aims to communicate reliably with minimal data transfer while simultaneously providing seamless connectivity to heterogeneous services and users. In this paper, a novel emergent SC (ESC) system framework is proposed and is composed of a signaling game for emergent language design and a neuro-symbolic (NeSy) artificial intelligence (AI) approach for causal reasoning. In order to design the language, the signaling game is solved using an alternating maximization between the communicating node's utilities. The emergent language helps create a context-aware transmit vocabulary (minimal semantic representation) and aids the reasoning process (enabling generalization to unseen scenarios) by splitting complex messages into simpler reasoning tasks for the receiver. The causal description at the transmitter is then modeled (a neural component) as a posterior distribution of the relevant attributes present in the data. Using the reconstructed causal state, the receiver evaluates a set of logical formulas (symbolic part) to execute its task. The nodes NeSy reasoning components are implemented by the recently proposed AI tool called Generative Flow Networks, and they are optimized for higher semantic reliability. The ESC system is designed to enhance the novel metrics of semantic information, reliability, distortion and similarity that are designed using rigorous algebraic properties from category theory thereby generalizing the metrics beyond Shannon's notion of uncertainty. Simulation results validate the ability of ESC to communicate efficiently (with reduced bits) and achieve better semantic reliability than conventional wireless and state-of-the-art systems that do not exploit causal reasoning capabilities.
    The Fairness Stitch: Unveiling the Potential of Model Stitching in Neural Network De-Biasing. (arXiv:2311.03532v1 [cs.LG])
    The pursuit of fairness in machine learning models has emerged as a critical research challenge in different applications ranging from bank loan approval to face detection. Despite the widespread adoption of artificial intelligence algorithms across various domains, concerns persist regarding the presence of biases and discrimination within these models. To address this pressing issue, this study introduces a novel method called "The Fairness Stitch (TFS)" to enhance fairness in deep learning models. This method combines model stitching and training jointly, while incorporating fairness constraints. In this research, we assess the effectiveness of our proposed method by conducting a comprehensive evaluation of two well-known datasets, CelebA and UTKFace. We systematically compare the performance of our approach with the existing baseline method. Our findings reveal a notable improvement in achieving a balanced trade-off between fairness and performance, highlighting the promising potential of our method to address bias-related challenges and foster equitable outcomes in machine learning models. This paper poses a challenge to the conventional wisdom of the effectiveness of the last layer in deep learning models for de-biasing.
    Large Language Models as Superpositions of Cultural Perspectives. (arXiv:2307.07870v3 [cs.CL] UPDATED)
    Large Language Models (LLMs) are often misleadingly recognized as having a personality or a set of values. We argue that an LLM can be seen as a superposition of perspectives with different values and personality traits. LLMs exhibit context-dependent values and personality traits that change based on the induced perspective (as opposed to humans, who tend to have more coherent values and personality traits across contexts). We introduce the concept of perspective controllability, which refers to a model's affordance to adopt various perspectives with differing values and personality traits. In our experiments, we use questionnaires from psychology (PVQ, VSM, IPIP) to study how exhibited values and personality traits change based on different perspectives. Through qualitative experiments, we show that LLMs express different values when those are (implicitly or explicitly) implied in the prompt, and that LLMs express different values even when those are not obviously implied (demonstrating their context-dependent nature). We then conduct quantitative experiments to study the controllability of different models (GPT-4, GPT-3.5, OpenAssistant, StableVicuna, StableLM), the effectiveness of various methods for inducing perspectives, and the smoothness of the models' drivability. We conclude by examining the broader implications of our work and outline a variety of associated scientific questions. The project website is available at https://sites.google.com/view/llm-superpositions .
    Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance. (arXiv:2306.02866v2 [cs.LG] UPDATED)
    We present a novel framework to overcome the limitations of equivariant architectures in learning functions with group symmetries. In contrary to equivariant architectures, we use an arbitrary base model such as an MLP or a transformer and symmetrize it to be equivariant to the given group by employing a small equivariant network that parameterizes the probabilistic distribution underlying the symmetrization. The distribution is end-to-end trained with the base model which can maximize performance while reducing sample complexity of symmetrization. We show that this approach ensures not only equivariance to given group but also universal approximation capability in expectation. We implement our method on various base models, including patch-based transformers that can be initialized from pretrained vision transformers, and test them for a wide range of symmetry groups including permutation and Euclidean groups and their combinations. Empirical tests show competitive results against tailored equivariant architectures, suggesting the potential for learning equivariant functions for diverse groups using a non-equivariant universal base architecture. We further show evidence of enhanced learning in symmetric modalities, like graphs, when pretrained from non-symmetric modalities, like vision. Code is available at https://github.com/jw9730/lps.
    A Physics-Guided Bi-Fidelity Fourier-Featured Operator Learning Framework for Predicting Time Evolution of Drag and Lift Coefficients. (arXiv:2311.03639v1 [cs.LG])
    In the pursuit of accurate experimental and computational data while minimizing effort, there is a constant need for high-fidelity results. However, achieving such results often requires significant computational resources. To address this challenge, this paper proposes a deep operator learning-based framework that requires a limited high-fidelity dataset for training. We introduce a novel physics-guided, bi-fidelity, Fourier-featured Deep Operator Network (DeepONet) framework that effectively combines low and high-fidelity datasets, leveraging the strengths of each. In our methodology, we began by designing a physics-guided Fourier-featured DeepONet, drawing inspiration from the intrinsic physical behavior of the target solution. Subsequently, we train this network to primarily learn the low-fidelity solution, utilizing an extensive dataset. This process ensures a comprehensive grasp of the foundational solution patterns. Following this foundational learning, the low-fidelity deep operator network's output is enhanced using a physics-guided Fourier-featured residual deep operator network. This network refines the initial low-fidelity output, achieving the high-fidelity solution by employing a small high-fidelity dataset for training. Notably, in our framework, we employ the Fourier feature network as the Trunk network for the DeepONets, given its proficiency in capturing and learning the oscillatory nature of the target solution with high precision. We validate our approach using a well-known 2D benchmark cylinder problem, which aims to predict the time trajectories of lift and drag coefficients. The results highlight that the physics-guided Fourier-featured deep operator network, serving as a foundational building block of our framework, possesses superior predictive capability for the lift and drag coefficients compared to its data-driven counterparts.  ( 3 min )
    PINNs error estimates for nonlinear equations in $\mathbb{R}$-smooth Banach spaces. (arXiv:2305.11915v2 [math.FA] UPDATED)
    In the paper, we describe in operator form classes of PDEs that admit PINN's error estimation. Also, for $L^p$ spaces, we obtain a Bramble-Hilbert type lemma that is a tool for PINN's residuals bounding.
    Learning Disentangled Speech Representations. (arXiv:2311.03389v1 [eess.AS])
    Disentangled representation learning from speech remains limited despite its importance in many application domains. A key challenge is the lack of speech datasets with known generative factors to evaluate methods. This paper proposes SynSpeech: a novel synthetic speech dataset with ground truth factors enabling research on disentangling speech representations. We plan to present a comprehensive study evaluating supervised techniques using established supervised disentanglement metrics. This benchmark dataset and framework address the gap in the rigorous evaluation of state-of-the-art disentangled speech representation learning methods. Our findings will provide insights to advance this underexplored area and enable more robust speech representations.  ( 2 min )
    Graph Construction using Principal Axis Trees for Simple Graph Convolution. (arXiv:2302.12000v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are increasingly becoming the favorite method for graph learning. They exploit the semi-supervised nature of deep learning, and they bypass computational bottlenecks associated with traditional graph learning methods. In addition to the feature matrix $X$, GNNs need an adjacency matrix $A$ to perform feature propagation. In many cases, the adjacency matrix $A$ is missing. We introduce a graph construction scheme that constructs the adjacency matrix $A$ using unsupervised and supervised information. Unsupervised information characterizes the neighborhood around points. We used Principal Axis trees (PA-trees) as a source for unsupervised information, where we create edges between points falling onto the same leaf node. For supervised information, we used the concept of penalty and intrinsic graphs. A penalty graph connects points with different class labels, whereas an intrinsic graph connects points with the same class labels. We used the penalty and intrinsic graphs to remove or add edges to the graph constructed via PA-tree. We tested this graph construction scheme on two well-known GNNs: 1) Graph Convolutional Network (GCN) and 2) Simple Graph Convolution (SGC). The experiments show that it is better to use SGC because it is faster and delivers better or the same results as GCN. We also test the effect of oversmoothing on both GCN and SGC. We found out that the level of smoothing has to be carefully selected for SGC to avoid oversmoothing.  ( 3 min )
    Efficient Approximations of Complete Interatomic Potentials for Crystal Property Prediction. (arXiv:2306.10045v9 [physics.chem-ph] UPDATED)
    We study property prediction for crystal materials. A crystal structure consists of a minimal unit cell that is repeated infinitely in 3D space. How to accurately represent such repetitive structures in machine learning models remains unresolved. Current methods construct graphs by establishing edges only between nearby nodes, thereby failing to faithfully capture infinite repeating patterns and distant interatomic interactions. In this work, we propose several innovations to overcome these limitations. First, we propose to model physics-principled interatomic potentials directly instead of only using distances as in many existing methods. These potentials include the Coulomb potential, London dispersion potential, and Pauli repulsion potential. Second, we model the complete set of potentials among all atoms, instead of only between nearby atoms as in existing methods. This is enabled by our approximations of infinite potential summations, where we extend the Ewald summation for several potential series approximations with provable error bounds. Finally, we propose to incorporate our computations of complete interatomic potentials into message passing neural networks for representation learning. We perform experiments on the JARVIS and Materials Project benchmarks for evaluation. Results show that the use of interatomic potentials and complete interatomic potentials leads to consistent performance improvements with reasonable computational costs. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS/tree/main/OpenMat/PotNet).
    How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?. (arXiv:2302.03679v2 [cs.LG] UPDATED)
    Many important computer vision applications are naturally formulated as regression problems. Within medical imaging, accurate regression models have the potential to automate various tasks, helping to lower costs and improve patient outcomes. Such safety-critical deployment does however require reliable estimation of model uncertainty, also under the wide variety of distribution shifts that might be encountered in practice. Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. This uncovers important limitations of current uncertainty estimation methods, and the proposed benchmark therefore serves as a challenge to the research community. We hope that our benchmark will spur more work on how to develop truly reliable regression uncertainty estimation methods. Code is available at https://github.com/fregu856/regression_uncertainty.
    Online learning of long-range dependencies. (arXiv:2305.15947v2 [cs.LG] UPDATED)
    Online learning holds the promise of enabling efficient long-term credit assignment in recurrent neural networks. However, current algorithms fall short of offline backpropagation by either not being scalable or failing to learn long-range dependencies. Here we present a high-performance online learning algorithm that merely doubles the memory and computational requirements of a single inference pass. We achieve this by leveraging independent recurrent modules in multi-layer networks, an architectural motif that has recently been shown to be particularly powerful. Experiments on synthetic memory problems and on the challenging long-range arena benchmark suite reveal that our algorithm performs competitively, establishing a new standard for what can be achieved through online learning. This ability to learn long-range dependencies offers a new perspective on learning in the brain and opens a promising avenue in neuromorphic computing.
    ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning. (arXiv:2311.03721v1 [cs.LG])
    Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.
    Task-Driven Detection of Distribution Shifts with Statistical Guarantees for Robot Learning. (arXiv:2106.13703v6 [cs.RO] UPDATED)
    Our goal is to perform out-of-distribution (OOD) detection, i.e., to detect when a robot is operating in environments drawn from a different distribution than the ones used to train the robot. We leverage Probably Approximately Correct (PAC)-Bayes theory to train a policy with a guaranteed bound on performance on the training distribution. Our idea for OOD detection relies on the following intuition: violation of the performance bound on test environments provides evidence that the robot is operating OOD. We formalize this via statistical techniques based on p-values and concentration inequalities. The approach provides guaranteed confidence bounds on OOD detection including bounds on both the false positive and false negative rates of the detector and is task-driven and only sensitive to changes that impact the robot's performance. We demonstrate our approach in simulation and hardware for a grasping task using objects with unfamiliar shapes or poses and a drone performing vision-based obstacle avoidance in environments with wind disturbances and varied obstacle densities. Our examples demonstrate that we can perform task-driven OOD detection within just a handful of trials.  ( 3 min )
    Dynamic Non-monotone Submodular Maximization. (arXiv:2311.03685v1 [cs.DS])
    Maximizing submodular functions has been increasingly used in many applications of machine learning, such as data summarization, recommendation systems, and feature selection. Moreover, there has been a growing interest in both submodular maximization and dynamic algorithms. In 2020, Monemizadeh and Lattanzi, Mitrovic, Norouzi{-}Fard, Tarnawski, and Zadimoghaddam initiated developing dynamic algorithms for the monotone submodular maximization problem under the cardinality constraint $k$. Recently, there have been some improvements on the topic made by Banihashem, Biabani, Goudarzi, Hajiaghayi, Jabbarzade, and Monemizadeh. In 2022, Chen and Peng studied the complexity of this problem and raised an important open question: "Can we extend [fully dynamic] results (algorithm or hardness) to non-monotone submodular maximization?". We affirmatively answer their question by demonstrating a reduction from maximizing a non-monotone submodular function under the cardinality constraint $k$ to maximizing a monotone submodular function under the same constraint. Through this reduction, we obtain the first dynamic algorithms to solve the non-monotone submodular maximization problem under the cardinality constraint $k$. Our algorithms maintain an $(8+\epsilon)$-approximate of the solution and use expected amortized $O(\epsilon^{-3}k^3\log^3(n)\log(k))$ or $O(\epsilon^{-1}k^2\log^3(k))$ oracle queries per update, respectively. Furthermore, we showcase the benefits of our dynamic algorithm for video summarization and max-cut problems on several real-world data sets.  ( 2 min )
    Are Words Enough? On the semantic conditioning of affective music generation. (arXiv:2311.03624v1 [cs.MM])
    Music has been commonly recognized as a means of expressing emotions. In this sense, an intense debate emerges from the need to verbalize musical emotions. This concern seems highly relevant today, considering the exponential growth of natural language processing using deep learning models where it is possible to prompt semantic propositions to generate music automatically. This scoping review aims to analyze and discuss the possibilities of music generation conditioned by emotions. To address this topic, we propose a historical perspective that encompasses the different disciplines and methods contributing to this topic. In detail, we review two main paradigms adopted in automatic music generation: rules-based and machine-learning models. Of note are the deep learning architectures that aim to generate high-fidelity music from textual descriptions. These models raise fundamental questions about the expressivity of music, including whether emotions can be represented with words or expressed through them. We conclude that overcoming the limitation and ambiguity of language to express emotions through music, some of the use of deep learning with natural language has the potential to impact the creative industries by providing powerful tools to prompt and generate new musical works.
    Plug-and-Play Stability for Intracortical Brain-Computer Interfaces: A One-Year Demonstration of Seamless Brain-to-Text Communication. (arXiv:2311.03611v1 [cs.HC])
    Intracortical brain-computer interfaces (iBCIs) have shown promise for restoring rapid communication to people with neurological disorders such as amyotrophic lateral sclerosis (ALS). However, to maintain high performance over time, iBCIs typically need frequent recalibration to combat changes in the neural recordings that accrue over days. This requires iBCI users to stop using the iBCI and engage in supervised data collection, making the iBCI system hard to use. In this paper, we propose a method that enables self-recalibration of communication iBCIs without interrupting the user. Our method leverages large language models (LMs) to automatically correct errors in iBCI outputs. The self-recalibration process uses these corrected outputs ("pseudo-labels") to continually update the iBCI decoder online. Over a period of more than one year (403 days), we evaluated our Continual Online Recalibration with Pseudo-labels (CORP) framework with one clinical trial participant. CORP achieved a stable decoding accuracy of 93.84% in an online handwriting iBCI task, significantly outperforming other baseline methods. Notably, this is the longest-running iBCI stability demonstration involving a human participant. Our results provide the first evidence for long-term stabilization of a plug-and-play, high-performance communication iBCI, addressing a major barrier for the clinical translation of iBCIs.
    Towards Accelerated Model Training via Bayesian Data Selection. (arXiv:2308.10544v3 [cs.LG] UPDATED)
    Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.
    Innovation and Word Usage Patterns in Machine Learning. (arXiv:2311.03633v1 [cs.LG])
    In this study, we delve into the dynamic landscape of machine learning research evolution. Initially, through the utilization of Latent Dirichlet Allocation, we discern pivotal themes and fundamental concepts that have emerged within the realm of machine learning. Subsequently, we undertake a comprehensive analysis to track the evolutionary trajectories of these identified themes. To quantify the novelty and divergence of research contributions, we employ the Kullback-Leibler Divergence metric. This statistical measure serves as a proxy for ``surprise'', indicating the extent of differentiation between the content of academic papers and the subsequent developments in research. By amalgamating these insights, we gain the ability to ascertain the pivotal roles played by prominent researchers and the significance of specific academic venues (periodicals and conferences) within the machine learning domain.
    Loss Balancing for Fair Supervised Learning. (arXiv:2311.03714v1 [cs.LG])
    Supervised learning models have been used in various domains such as lending, college admission, face recognition, natural language processing, etc. However, they may inherit pre-existing biases from training data and exhibit discrimination against protected social groups. Various fairness notions have been proposed to address unfairness issues. In this work, we focus on Equalized Loss (EL), a fairness notion that requires the expected loss to be (approximately) equalized across different groups. Imposing EL on the learning process leads to a non-convex optimization problem even if the loss function is convex, and the existing fair learning algorithms cannot properly be adopted to find the fair predictor under the EL constraint. This paper introduces an algorithm that can leverage off-the-shelf convex programming tools (e.g., CVXPY) to efficiently find the global optimum of this non-convex optimization. In particular, we propose the ELminimizer algorithm, which finds the optimal fair predictor under EL by reducing the non-convex optimization to a sequence of convex optimization problems. We theoretically prove that our algorithm finds the global optimal solution under certain conditions. Then, we support our theoretical results through several empirical studies.
    The Future of Consumer Edge-AI Computing. (arXiv:2210.10514v2 [cs.LG] UPDATED)
    In the last decade, Deep Learning has rapidly infiltrated the consumer end, mainly thanks to hardware acceleration across devices. However, as we look towards the future, it is evident that isolated hardware will be insufficient. Increasingly complex AI tasks demand shared resources, cross-device collaboration, and multiple data types, all without compromising user privacy or quality of experience. To address this, we introduce a novel paradigm centered around EdgeAI-Hub devices, designed to reorganise and optimise compute resources and data access at the consumer edge. To this end, we lay a holistic foundation for the transition from on-device to Edge-AI serving systems in consumer environments, detailing their components, structure, challenges and opportunities.
    A Simple and Efficient Baseline for Data Attribution on Images. (arXiv:2311.03386v1 [cs.CV])
    Data attribution methods play a crucial role in understanding machine learning models, providing insight into which training data points are most responsible for model outputs during deployment. However, current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions. These approaches therefore come at a high computational cost, are memory intensive, and are hard to scale to large models or datasets. In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution. Our method is model-agnostic and scales easily to large datasets. We show results on CIFAR-10 and ImageNet, achieving strong performance that rivals or outperforms state-of-the-art approaches at a fraction of the compute or memory cost. Contrary to prior work, our results reinforce the intuition that a model's prediction on one image is most impacted by visually similar training samples. Our approach serves as a simple and efficient baseline for data attribution on images.  ( 2 min )
    Hypothesis Network Planned Exploration for Rapid Meta-Reinforcement Learning Adaptation. (arXiv:2311.03701v1 [cs.AI])
    Meta Reinforcement Learning (Meta RL) trains agents that adapt to fast-changing environments and tasks. Current strategies often lose adaption efficiency due to the passive nature of model exploration, causing delayed understanding of new transition dynamics. This results in particularly fast-evolving tasks being impossible to solve. We propose a novel approach, Hypothesis Network Planned Exploration (HyPE), that integrates an active and planned exploration process via the hypothesis network to optimize adaptation speed. HyPE uses a generative hypothesis network to form potential models of state transition dynamics, then eliminates incorrect models through strategically devised experiments. Evaluated on a symbolic version of the Alchemy game, HyPE outpaces baseline methods in adaptation speed and model accuracy, validating its potential in enhancing reinforcement learning adaptation in rapidly evolving settings.
    Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs. (arXiv:2311.03365v1 [cs.SE])
    In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful." We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process. We address this task by incorporating generated code and comment pairs. The initial dataset comprised 9048 pairs of code and comments written in C, labeled as either Useful or Not Useful. To augment this dataset, we sourced an additional 739 lines of code-comment pairs and generated labels using a Large Language Model Architecture, specifically BERT. The primary objective was to build classification models that can effectively differentiate between useful and not useful code comments. Various machine learning algorithms were employed, including Logistic Regression, Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Gradient Boosting, Random Forest, and a Neural Network. Each algorithm was evaluated using precision, recall, and F1-score metrics, both with the original seed dataset and the augmented dataset. This study showcases the potential of generative AI for enhancing binary code comment quality classification models, providing valuable insights for software developers and researchers in the field of natural language processing and software engineering.  ( 3 min )
    Mobile Augmented Reality with Federated Learning in the Metaverse. (arXiv:2212.08324v2 [cs.LG] UPDATED)
    The Metaverse is deemed the next evolution of the Internet and has received much attention recently. Metaverse applications via mobile augmented reality (MAR) require rapid and accurate object detection to mix digital data with the real world. As mobile devices evolve, their computational capabilities are increasing, and thus their computational resources can be leveraged to train machine learning models. In light of the increasing concerns of user privacy and data security, federated learning (FL) has become a promising distributed learning framework for privacy-preserving analytics. In this article, FL and MAR are brought together in the Metaverse. We discuss the necessity and rationality of the combination of FL and MAR. The prospective technologies that support FL and MAR in the Metaverse are also discussed. In addition, existing challenges that prevent the fulfillment of FL and MAR in the Metaverse and several application scenarios are presented. Finally, three case studies of Metaverse FL-MAR systems are demonstrated.
    Separating and Learning Latent Confounders to Enhancing User Preferences Modeling. (arXiv:2311.03381v1 [cs.IR])
    Recommender models aim to capture user preferences from historical feedback and then predict user-specific feedback on candidate items. However, the presence of various unmeasured confounders causes deviations between the user preferences in the historical feedback and the true preferences, resulting in models not meeting their expected performance. Existing debias models either (1) specific to solving one particular bias or (2) directly obtain auxiliary information from user historical feedback, which cannot identify whether the learned preferences are true user preferences or mixed with unmeasured confounders. Moreover, we find that the former recommender system is not only a successor to unmeasured confounders but also acts as an unmeasured confounder affecting user preference modeling, which has always been neglected in previous studies. To this end, we incorporate the effect of the former recommender system and treat it as a proxy for all unmeasured confounders. We propose a novel framework, \textbf{S}eparating and \textbf{L}earning Latent Confounders \textbf{F}or \textbf{R}ecommendation (\textbf{SLFR}), which obtains the representation of unmeasured confounders to identify the counterfactual feedback by disentangling user preferences and unmeasured confounders, then guides the target model to capture the true preferences of users. Extensive experiments in five real-world datasets validate the advantages of our method.
    OpenGSL: A Comprehensive Benchmark for Graph Structure Learning. (arXiv:2306.10280v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have emerged as the de facto standard for representation learning on graphs, owing to their ability to effectively integrate graph topology and node attributes. However, the inherent suboptimal nature of node connections, resulting from the complex and contingent formation process of graphs, presents significant challenges in modeling them effectively. To tackle this issue, Graph Structure Learning (GSL), a family of data-centric learning approaches, has garnered substantial attention in recent years. The core concept behind GSL is to jointly optimize the graph structure and the corresponding GNN models. Despite the proposal of numerous GSL methods, the progress in this field remains unclear due to inconsistent experimental protocols, including variations in datasets, data processing techniques, and splitting strategies. In this paper, we introduce OpenGSL, the first comprehensive benchmark for GSL, aimed at addressing this gap. OpenGSL enables a fair comparison among state-of-the-art GSL methods by evaluating them across various popular datasets using uniform data processing and splitting strategies. Through extensive experiments, we observe that existing GSL methods do not consistently outperform vanilla GNN counterparts. We also find that there is no significant correlation between the homophily of the learned structure and task performance, challenging the common belief. Moreover, we observe that the learned graph structure demonstrates a strong generalization ability across different GNN models, despite the high computational and space consumption. We hope that our open-sourced library will facilitate rapid and equitable evaluation and inspire further innovative research in this field. The code of the benchmark can be found in https://github.com/OpenGSL/OpenGSL.
    Explicit Planning Helps Language Models in Logical Reasoning. (arXiv:2303.15714v4 [cs.CL] UPDATED)
    Language models have been shown to perform remarkably well on a wide range of natural language processing tasks. In this paper, we propose LEAP, a novel system that uses language models to perform multi-step logical reasoning and incorporates explicit planning into the inference procedure. Explicit planning enables the system to make more informed reasoning decisions at each step by looking ahead into their future effects. Moreover, we propose a training strategy that safeguards the planning process from being led astray by spurious features. Our full system significantly outperforms other competing methods on multiple standard datasets. When using small T5 models as its core selection and deduction components, our system performs competitively compared to GPT-3 despite having only about 1B parameters (i.e., 175 times smaller than GPT-3). When using GPT-3.5, it significantly outperforms chain-of-thought prompting on the challenging PrOntoQA dataset. We have conducted extensive empirical studies to demonstrate that explicit planning plays a crucial role in the system's performance.
    Manifold learning: what, how, and why. (arXiv:2311.03757v1 [stat.ML])
    Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
    Toward Reinforcement Learning-based Rectilinear Macro Placement Under Human Constraints. (arXiv:2311.03383v1 [cs.LG])
    Macro placement is a critical phase in chip design, which becomes more intricate when involving general rectilinear macros and layout areas. Furthermore, macro placement that incorporates human-like constraints, such as design hierarchy and peripheral bias, has the potential to significantly reduce the amount of additional manual labor required from designers. This study proposes a methodology that leverages an approach suggested by Google's Circuit Training (G-CT) to provide a learning-based macro placer that not only supports placing rectilinear cases, but also adheres to crucial human-like design principles. Our experimental results demonstrate the effectiveness of our framework in achieving power-performance-area (PPA) metrics and in obtaining placements of high quality, comparable to those produced with human intervention. Additionally, our methodology shows potential as a generalized model to address diverse macro shapes and layout areas.
    Learned Causal Method Prediction. (arXiv:2311.03989v1 [cs.LG])
    For a given causal question, it is important to efficiently decide which causal inference method to use for a given dataset. This is challenging because causal methods typically rely on complex and difficult-to-verify assumptions, and cross-validation is not applicable since ground truth causal quantities are unobserved.In this work, we propose CAusal Method Predictor (CAMP), a framework for predicting the best method for a given dataset. To this end, we generate datasets from a diverse set of synthetic causal models, score the candidate methods, and train a model to directly predict the highest-scoring method for that dataset. Next, by formulating a self-supervised pre-training objective centered on dataset assumptions relevant for causal inference, we significantly reduce the need for costly labeled data and enhance training efficiency. Our strategy learns to map implicit dataset properties to the best method in a data-driven manner. In our experiments, we focus on method prediction for causal discovery. CAMP outperforms selecting any individual candidate method and demonstrates promising generalization to unseen semi-synthetic and real-world benchmarks.
    Image Amodal Completion: A Survey. (arXiv:2207.02062v3 [cs.CV] UPDATED)
    Existing computer vision systems can compete with humans in understanding the visible parts of objects, but still fall far short of humans when it comes to depicting the invisible parts of partially occluded objects. Image amodal completion aims to equip computers with human-like amodal completion functions to understand an intact object despite it being partially occluded. The main purpose of this survey is to provide an intuitive understanding of the research hotspots, key technologies and future trends in the field of image amodal completion. Firstly, we present a comprehensive review of the latest literature in this emerging field, exploring three key tasks in image amodal completion, including amodal shape completion, amodal appearance completion, and order perception. Then we examine popular datasets related to image amodal completion along with their common data collection methods and evaluation metrics. Finally, we discuss real-world applications and future research directions for image amodal completion, facilitating the reader's understanding of the challenges of existing technologies and upcoming research trends.
    Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data. (arXiv:2311.00136v2 [q-bio.NC] UPDATED)
    State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive spatiotemporal generation problem. Neuroformer is a multimodal, multitask generative pretrained transformer (GPT) model that is specifically designed to handle the intricacies of data in systems neuroscience. It scales linearly with feature size, can process an arbitrary number of modalities, and is adaptable to downstream tasks, such as predicting behavior. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted simulated neuronal circuit activity, and also intrinsically inferred the underlying neural circuit connectivity, including direction. When pretrained to decode neural responses, the model predicted the behavior of a mouse with only few-shot fine-tuning, suggesting that the model begins learning how to do so directly from the neural representations themselves, without any explicit supervision. We used an ablation study to show that joint training on neuronal responses and behavior boosted performance, highlighting the model's ability to associate behavioral and neural representations in an unsupervised manner. These findings show that Neuroformer can analyze neural datasets and their emergent properties, informing the development of models and hypotheses associated with the brain.
    Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models. (arXiv:2311.03687v1 [cs.PF])
    Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
    Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for Deep Reinforcement Learning. (arXiv:2311.03711v1 [cs.LG])
    We address the issue of estimation bias in deep reinforcement learning (DRL) by introducing solution mechanisms that include a new, twin TD-regularized actor-critic (TDR) method. It aims at reducing both over and under-estimation errors. With TDR and by combining good DRL improvements, such as distributional learning and long N-step surrogate stage reward (LNSS) method, we show that our new TDR-based actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite. Furthermore, they elevate TD3 and SAC respectively to a level of performance comparable to that of D4PG (the current SOTA), and they also improve the performance of D4PG to a new SOTA level measured by mean reward, convergence speed, learning success rate, and learning variance.
    Convergence of Adam Under Relaxed Assumptions. (arXiv:2304.13972v3 [math.OC] UPDATED)
    In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with ${O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(\epsilon^{-3})$.  ( 2 min )
    Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning. (arXiv:2311.03736v1 [cs.AI])
    Neural MMO 2.0 is a massively multi-agent environment for reinforcement learning research. The key feature of this new version is a flexible task system that allows users to define a broad range of objectives and reward signals. We challenge researchers to train agents capable of generalizing to tasks, maps, and opponents never seen during training. Neural MMO features procedurally generated maps with 128 agents in the standard setting and support for up to. Version 2.0 is a complete rewrite of its predecessor with three-fold improved performance and compatibility with CleanRL. We release the platform as free and open-source software with comprehensive documentation available at neuralmmo.github.io and an active community Discord. To spark initial research on this new platform, we are concurrently running a competition at NeurIPS 2023.
    Guaranteed Conformance of Neurosymbolic Models to Natural Constraints. (arXiv:2212.01346v8 [cs.LG] UPDATED)
    Deep neural networks have emerged as the workhorse for a large section of robotics and control applications, especially as models for dynamical systems. Such data-driven models are in turn used for designing and verifying autonomous systems. They are particularly useful in modeling medical systems where data can be leveraged to individualize treatment. In safety-critical applications, it is important that the data-driven model is conformant to established knowledge from the natural sciences. Such knowledge is often available or can often be distilled into a (possibly black-box) model. For instance, an F1 racing car should conform to Newton's laws (which are encoded within a unicycle model). In this light, we consider the following problem - given a model $M$ and a state transition dataset, we wish to best approximate the system model while being a bounded distance away from $M$. We propose a method to guarantee this conformance. Our first step is to distill the dataset into a few representative samples called memories, using the idea of a growing neural gas. Next, using these memories we partition the state space into disjoint subsets and compute bounds that should be respected by the neural network in each subset. This serves as a symbolic wrapper for guaranteed conformance. We argue theoretically that this only leads to a bounded increase in approximation error; which can be controlled by increasing the number of memories. We experimentally show that on three case studies (Car Model, Drones, and Artificial Pancreas), our constrained neurosymbolic models conform to specified models (each encoding various constraints) with order-of-magnitude improvements compared to the augmented Lagrangian and vanilla training methods. Our code can be found at: https://github.com/kaustubhsridhar/Constrained_Models
    Illumination Variation Correction Using Image Synthesis For Unsupervised Domain Adaptive Person Re-Identification. (arXiv:2301.09702v3 [eess.IV] UPDATED)
    Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.
    Measuring Adversarial Datasets. (arXiv:2311.03566v1 [cs.LG])
    In the era of widespread public use of AI systems across various domains, ensuring adversarial robustness has become increasingly vital to maintain safety and prevent undesirable errors. Researchers have curated various adversarial datasets (through perturbations) for capturing model deficiencies that cannot be revealed in standard benchmark datasets. However, little is known about how these adversarial examples differ from the original data points, and there is still no methodology to measure the intended and unintended consequences of those adversarial transformations. In this research, we conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks, among dimensions of difficulty, diversity, and disagreement. We selected several current adversarial effect datasets and compared the distributions between the original and their adversarial counterparts. The results provide valuable insights into what makes these datasets more challenging from a metrics perspective and whether they align with underlying assumptions.
    A graph convolutional autoencoder approach to model order reduction for parametrized PDEs. (arXiv:2305.08573v2 [math.NA] UPDATED)
    The present work proposes a framework for nonlinear model order reduction based on a Graph Convolutional Autoencoder (GCA-ROM). In the reduced order modeling (ROM) context, one is interested in obtaining real-time and many-query evaluations of parametric Partial Differential Equations (PDEs). Linear techniques such as Proper Orthogonal Decomposition (POD) and Greedy algorithms have been analyzed thoroughly, but they are more suitable when dealing with linear and affine models showing a fast decay of the Kolmogorov n-width. On one hand, the autoencoder architecture represents a nonlinear generalization of the POD compression procedure, allowing one to encode the main information in a latent set of variables while extracting their main features. On the other hand, Graph Neural Networks (GNNs) constitute a natural framework for studying PDE solutions defined on unstructured meshes. Here, we develop a non-intrusive and data-driven nonlinear reduction approach, exploiting GNNs to encode the reduced manifold and enable fast evaluations of parametrized PDEs. We show the capabilities of the methodology for several models: linear/nonlinear and scalar/vector problems with fast/slow decay in the physically and geometrically parametrized setting. The main properties of our approach consist of (i) high generalizability in the low-data regime even for complex regimes, (ii) physical compliance with general unstructured grids, and (iii) exploitation of pooling and un-pooling operations to learn from scattered data.  ( 3 min )
    Curating Naturally Adversarial Datasets for Learning-Enabled Medical Cyber-Physical Systems. (arXiv:2309.00543v2 [cs.LG] UPDATED)
    Deep learning models have shown promising predictive accuracy for time-series healthcare applications. However, ensuring the robustness of these models is vital for building trustworthy AI systems. Existing research predominantly focuses on robustness to synthetic adversarial examples, crafted by adding imperceptible perturbations to clean input data. However, these synthetic adversarial examples do not accurately reflect the most challenging real-world scenarios, especially in the context of healthcare data. Consequently, robustness to synthetic adversarial examples may not necessarily translate to robustness against naturally occurring adversarial examples, which is highly desirable for trustworthy AI. We propose a method to curate datasets comprised of natural adversarial examples to evaluate model robustness. The method relies on probabilistic labels obtained from automated weakly-supervised labeling that combines noisy and cheap-to-obtain labeling heuristics. Based on these labels, our method adversarially orders the input data and uses this ordering to construct a sequence of increasingly adversarial datasets. Our evaluation on six medical case studies and three non-medical case studies demonstrates the efficacy and statistical validity of our approach to generating naturally adversarial datasets
    Learning-Based Optimal Control with Performance Guarantees for Unknown Systems with Latent States. (arXiv:2303.17963v2 [eess.SY] UPDATED)
    As control engineering methods are applied to increasingly complex systems, data-driven approaches for system identification appear as a promising alternative to physics-based modeling. While the Bayesian approaches prevalent for safety-critical applications usually rely on the availability of state measurements, the states of a complex system are often not directly measurable. It may then be necessary to jointly estimate the dynamics and the latent state, making the quantification of uncertainties and the design of controllers with formal performance guarantees considerably more challenging. This paper proposes a novel method for the computation of an optimal input trajectory for unknown nonlinear systems with latent states based on a combination of particle Markov chain Monte Carlo methods and scenario theory. Probabilistic performance guarantees are derived for the resulting input trajectory, and an approach to validate the performance of arbitrary control laws is presented. The effectiveness of the proposed method is demonstrated in a numerical simulation.
    DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with GFlowNets. (arXiv:2302.04178v3 [cs.LG] UPDATED)
    One of the grand challenges of cell biology is inferring the gene regulatory network (GRN) which describes interactions between genes and their products that control gene expression and cellular function. We can treat this as a causal discovery problem but with two non-standard challenges: (1) regulatory networks are inherently cyclic so we should not model a GRN as a directed acyclic graph (DAG), and (2) observations have significant measurement noise, so for typical sample sizes there will always be a large equivalence class of graphs that are likely given the data, and we want methods that capture this uncertainty. Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over DAGs, but not both. In this paper we leverage the fact that it is possible to estimate the "velocity" of gene expression with RNA velocity techniques to develop an approach that addresses both challenges. Because we have access to velocity information, we can treat the Bayesian structure learning problem as a problem of sparse identification of a dynamical system, capturing cyclic feedback loops through time. Since our objective is to model uncertainty over discrete structures, we leverage Generative Flow Networks (GFlowNets) to estimate the posterior distribution over the combinatorial space of possible sparse dependencies. Our results indicate that our method learns posteriors that better encapsulate the distributions of cyclic structures compared to counterpart state-of-the-art Bayesian structure learning approaches.
    DealMVC: Dual Contrastive Calibration for Multi-view Clustering. (arXiv:2308.09000v3 [cs.CV] UPDATED)
    Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at https://github.com/xihongyang1999/DealMVC on GitHub.
    MoleCLUEs: Molecular Conformers Maximally In-Distribution for Predictive Models. (arXiv:2306.11681v2 [cs.LG] UPDATED)
    Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance. We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.r.t. latent posteriors. We then iteratively sample new latents in the direction of lower uncertainty by gradient descent. As we train our predictive models jointly with a conformer decoder, the new latent embeddings can be mapped to their corresponding inputs, which we call \textit{MoleCLUEs}, or (molecular) counterfactual latent uncertainty explanations \citep{antoran2020getting}. We assess our algorithm for the task of predicting drug properties from 3D structure with maximum confidence. We additionally analyze the structure trajectories obtained from conformer optimizations, which provide insight into the sources of uncertainty in SBML.
    On efficient algorithms for computing near-best polynomial approximations to high-dimensional, Hilbert-valued functions from limited samples. (arXiv:2203.13908v2 [math.NA] UPDATED)
    Sparse polynomial approximation has become indispensable for approximating smooth, high- or infinite-dimensional functions from limited samples. This is a key task in computational science and engineering, e.g., surrogate modelling in uncertainty quantification where the function is the solution map of a parametric or stochastic differential equation (DE). Yet, sparse polynomial approximation lacks a complete theory. On the one hand, there is a well-developed theory of best $s$-term polynomial approximation, which asserts exponential or algebraic rates of convergence for holomorphic functions. On the other, there are increasingly mature methods such as (weighted) $\ell^1$-minimization for computing such approximations. While the sample complexity of these methods has been analyzed with compressed sensing, whether they achieve best $s$-term approximation rates is not fully understood. Furthermore, these methods are not algorithms per se, as they involve exact minimizers of nonlinear optimization problems. This paper closes these gaps. Specifically, we consider the following question: are there robust, efficient algorithms for computing approximations to finite- or infinite-dimensional, holomorphic and Hilbert-valued functions from limited samples that achieve best $s$-term rates? We answer this affirmatively by introducing algorithms and theoretical guarantees that assert exponential or algebraic rates of convergence, along with robustness to sampling, algorithmic, and physical discretization errors. We tackle both scalar- and Hilbert-valued functions, this being key to parametric or stochastic DEs. Our results involve significant developments of existing techniques, including a novel restarted primal-dual iteration for solving weighted $\ell^1$-minimization problems in Hilbert spaces. Our theory is supplemented by numerical experiments demonstrating the efficacy of these algorithms.
    An Initialization Schema for Neuronal Networks on Tabular Data. (arXiv:2311.03996v1 [cs.LG])
    Nowadays, many modern applications require heterogeneous tabular data, which is still a challenging task in terms of regression and classification. Many approaches have been proposed to adapt neural networks for this task, but still, boosting and bagging of decision trees are the best-performing methods for this task. In this paper, we show that a binomial initialized neural network can be used effectively on tabular data. The proposed approach shows a simple but effective approach for initializing the first hidden layer in neural networks. We also show that this initializing schema can be used to jointly train ensembles by adding gradient masking to batch entries and using the binomial initialization for the last layer in a neural network. For this purpose, we modified the hinge binary loss and the soft max loss to make them applicable for joint ensemble training. We evaluate our approach on multiple public datasets and showcase the improved performance compared to other neural network-based approaches. In addition, we discuss the limitations and possible further research of our approach for improving the applicability of neural networks to tabular data. Link: https://es-cloud.cs.uni-tuebingen.de/d/8e2ab8c3fdd444e1a135/?p=%2FInitializationNeuronalNetworksTabularData&mode=list
    FLORA: Fine-grained Low-Rank Architecture Search for Vision Transformer. (arXiv:2311.03912v1 [cs.CV])
    Vision Transformers (ViT) have recently demonstrated success across a myriad of computer vision tasks. However, their elevated computational demands pose significant challenges for real-world deployment. While low-rank approximation stands out as a renowned method to reduce computational loads, efficiently automating the target rank selection in ViT remains a challenge. Drawing from the notable similarity and alignment between the processes of rank selection and One-Shot NAS, we introduce FLORA, an end-to-end automatic framework based on NAS. To overcome the design challenge of supernet posed by vast search space, FLORA employs a low-rank aware candidate filtering strategy. This method adeptly identifies and eliminates underperforming candidates, effectively alleviating potential undertraining and interference among subnetworks. To further enhance the quality of low-rank supernets, we design a low-rank specific training paradigm. First, we propose weight inheritance to construct supernet and enable gradient sharing among low-rank modules. Secondly, we adopt low-rank aware sampling to strategically allocate training resources, taking into account inherited information from pre-trained models. Empirical results underscore FLORA's efficacy. With our method, a more fine-grained rank configuration can be generated automatically and yield up to 33% extra FLOPs reduction compared to a simple uniform configuration. More specific, FLORA-DeiT-B/FLORA-Swin-B can save up to 55%/42% FLOPs almost without performance degradtion. Importantly, FLORA boasts both versatility and orthogonality, offering an extra 21%-26% FLOPs reduction when integrated with leading compression techniques or compact hybrid structures. Our code is publicly available at https://github.com/shadowpa0327/FLORA.
    Its All Graph To Me: Foundational Topology Models with Contrastive Learning on Multiple Domains. (arXiv:2311.03976v1 [cs.LG])
    Representations and embeddings of graph data have been essential in many domains of research. The principle benefit of learning such representations is that the pre-trained model can be fine-tuned on smaller datasets where data or labels are scarse. Existing models, however, are domain specific; for example a model trained on molecular graphs is fine-tuned on other molecular graphs. This means that in many application cases the choice of pre-trained model can be arbitrary, and novel domains may lack an appropriate pre-trained model. This is of particular issue where data is scarse, precluding traditional supervised methods. In this work we use adversarial contrastive learning to present a \method, a model pre-trained on many graph domains. We train the model only on topologies but include node labels in evaluation. We evaluate the efficacy of its learnt representations on various downstream tasks. Against baseline models pre-trained on single domains, as well as un-trained models and non-transferred models, we show that performance is equal or better using our single model. This includes when node labels are used in evaluation, where performance is consistently superior to single-domain or non-pre-trained models.
    The NCI Imaging Data Commons as a platform for reproducible research in computational pathology. (arXiv:2303.09354v3 [cs.CV] UPDATED)
    Background and Objectives: Reproducibility is a major challenge in developing machine learning (ML)-based solutions in computational pathology (CompPath). The NCI Imaging Data Commons (IDC) provides >120 cancer image collections according to the FAIR principles and is designed to be used with cloud ML services. Here, we explore its potential to facilitate reproducibility in CompPath research. Methods: Using the IDC, we implemented two experiments in which a representative ML-based method for classifying lung tumor tissue was trained and/or evaluated on different datasets. To assess reproducibility, the experiments were run multiple times with separate but identically configured instances of common ML services. Results: The AUC values of different runs of the same experiment were generally consistent. However, we observed small variations in AUC values of up to 0.045, indicating a practical limit to reproducibility. Conclusions: We conclude that the IDC facilitates approaching the reproducibility limit of CompPath research (i) by enabling researchers to reuse exactly the same datasets and (ii) by integrating with cloud ML services so that experiments can be run in identically configured computing environments.  ( 3 min )
    Optimal Transport for Change Detection on LiDAR Point Clouds. (arXiv:2302.07025v4 [cs.CV] UPDATED)
    Unsupervised change detection between airborne LiDAR data points, taken at separate times over the same location, can be difficult due to unmatching spatial support and noise from the acquisition system. Most current approaches to detect changes in point clouds rely heavily on the computation of Digital Elevation Models (DEM) images and supervised methods. Obtaining a DEM leads to LiDAR informational loss due to pixelisation, and supervision requires large amounts of labelled data often unavailable in real-world scenarios. We propose an unsupervised approach based on the computation of the transport of 3D LiDAR points over two temporal supports. The method is based on unbalanced optimal transport and can be generalised to any change detection problem with LiDAR data. We apply our approach to publicly available datasets for monitoring urban sprawling in various noise and resolution configurations that mimic several sensors used in practice. Our method allows for unsupervised multi-class classification and outperforms the previous state-of-the-art unsupervised approaches by a significant margin.
    Climate-Invariant Machine Learning. (arXiv:2112.08440v3 [cs.LG] UPDATED)
    Projecting climate change is a generalization problem: we extrapolate the recent past using physical models across past, present, and future climates. Current climate models require representations of processes that occur at scales smaller than model grid size, which have been the main source of model projection uncertainty. Recent machine learning (ML) algorithms hold promise to improve such process representations, but tend to extrapolate poorly to climate regimes they were not trained on. To get the best of the physical and statistical worlds, we propose a new framework -- termed "climate-invariant" ML -- incorporating knowledge of climate processes into ML algorithms, and show that it can maintain high offline accuracy across a wide range of climate conditions and configurations in three distinct atmospheric models. Our results suggest that explicitly incorporating physical knowledge into data-driven models of Earth system processes can improve their consistency, data efficiency, and generalizability across climate regimes.
    Local Convergence of Gradient Methods for Min-Max Games: Partial Curvature Generically Suffices. (arXiv:2305.17275v2 [math.OC] UPDATED)
    We study the convergence to local Nash equilibria of gradient methods for two-player zero-sum differentiable games. It is well-known that such dynamics converge locally when $S \succ 0$ and may diverge when $S=0$, where $S\succeq 0$ is the symmetric part of the Jacobian at equilibrium that accounts for the "potential" component of the game. We show that these dynamics also converge as soon as $S$ is nonzero (partial curvature) and the eigenvectors of the antisymmetric part $A$ are in general position with respect to the kernel of $S$. We then study the convergence rates when $S \ll A$ and prove that they typically depend on the average of the eigenvalues of $S$, instead of the minimum as an analogy with minimization problems would suggest. To illustrate our results, we consider the problem of computing mixed Nash equilibria of continuous games. We show that, thanks to partial curvature, conic particle methods -- which optimize over both weights and supports of the mixed strategies -- generically converge faster than fixed-support methods. For min-max games, it is thus beneficial to add degrees of freedom "with curvature": this can be interpreted as yet another benefit of over-parameterization.
    Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks. (arXiv:2306.04186v2 [eess.AS] UPDATED)
    Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation. Our code is available online.
    PcLast: Discovering Plannable Continuous Latent States. (arXiv:2311.03534v1 [cs.LG])
    Goal-conditioned planning benefits from learned low-dimensional representations of rich, high-dimensional observations. While compact latent representations, typically learned from variational autoencoders or inverse dynamics, enable goal-conditioned planning they ignore state affordances, thus hampering their sample-efficient planning capabilities. In this paper, we learn a representation that associates reachable states together for effective onward planning. We first learn a latent representation with multi-step inverse dynamics (to remove distracting information); and then transform this representation to associate reachable states together in $\ell_2$ space. Our proposals are rigorously tested in various simulation testbeds. Numerical results in reward-based and reward-free settings show significant improvements in sampling efficiency, and yields layered state abstractions that enable computationally efficient hierarchical planning.
    Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors. (arXiv:2309.06782v3 [physics.data-an] UPDATED)
    We study scalable machine learning models for full event reconstruction in high-energy electron-positron collisions based on a highly granular detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters or hits. We compare a graph neural network and kernel-based transformer and demonstrate that both avoid quadratic memory allocation and computational cost while achieving realistic reconstruction. We show that hyperparameter tuning on a supercomputer significantly enhances the physics performance of the models, improving the jet transverse momentum resolution by up to 50% compared to the baseline. The resulting model is highly portable across hardware processors. Finally, we demonstrate that the model can be trained on highly granular inputs consisting of tracks and calorimeter hits, resulting in a competitive physics performance with the baseline. Datasets and software to reproduce the studies are published following the findable, accessible, interoperable, and reusable principles.
    Amodal Intra-class Instance Segmentation: Synthetic Datasets and Benchmark. (arXiv:2303.06596v2 [cs.CV] UPDATED)
    Images of realistic scenes often contain intra-class objects that are heavily occluded from each other, making the amodal perception task that requires parsing the occluded parts of the objects challenging. Although important for downstream tasks such as robotic grasping systems, the lack of large-scale amodal datasets with detailed annotations makes it difficult to model intra-class occlusions explicitly. This paper introduces two new amodal datasets for image amodal completion tasks, which contain a total of over 267K images of intra-class occlusion scenarios, annotated with multiple masks, amodal bounding boxes, dual order relations and full appearance for instances and background. We also present a point-supervised scheme with layer priors for amodal instance segmentation specifically designed for intra-class occlusion scenarios. Experiments show that our weakly supervised approach outperforms the SOTA fully supervised methods, while our layer priors design exhibits remarkable performance improvements in the case of intra-class occlusion in both synthetic and real images.  ( 2 min )
    AdaSub: Stochastic Optimization Using Second-Order Information in Low-Dimensional Subspaces. (arXiv:2310.20060v2 [math.OC] UPDATED)
    We introduce AdaSub, a stochastic optimization algorithm that computes a search direction based on second-order information in a low-dimensional subspace that is defined adaptively based on available current and past information. Compared to first-order methods, second-order methods exhibit better convergence characteristics, but the need to compute the Hessian matrix at each iteration results in excessive computational expenses, making them impractical. To address this issue, our approach enables the management of computational expenses and algorithm efficiency by enabling the selection of the subspace dimension for the search. Our code is freely available on GitHub, and our preliminary numerical results demonstrate that AdaSub surpasses popular stochastic optimizers in terms of time and number of iterations required to reach a given accuracy.
    FedMEKT: Distillation-based Embedding Knowledge Transfer for Multimodal Federated Learning. (arXiv:2307.13214v2 [cs.LG] UPDATED)
    Federated learning (FL) enables a decentralized machine learning paradigm for multiple clients to collaboratively train a generalized global model without sharing their private data. Most existing works simply propose typical FL systems for single-modal data, thus limiting its potential on exploiting valuable multimodal data for future personalized applications. Furthermore, the majority of FL approaches still rely on the labeled data at the client side, which is limited in real-world applications due to the inability of self-annotation from users. In light of these limitations, we propose a novel multimodal FL framework that employs a semi-supervised learning approach to leverage the representations from different modalities. Bringing this concept into a system, we develop a distillation-based multimodal embedding knowledge transfer mechanism, namely FedMEKT, which allows the server and clients to exchange the joint knowledge of their learning models extracted from a small multimodal proxy dataset. Our FedMEKT iteratively updates the generalized global encoders with the joint embedding knowledge from the participating clients. Thereby, to address the modality discrepancy and labeled data constraint in existing FL systems, our proposed FedMEKT comprises local multimodal autoencoder learning, generalized multimodal autoencoder construction, and generalized classifier learning. Through extensive experiments on three multimodal human activity recognition datasets, we demonstrate that FedMEKT achieves superior global encoder performance on linear evaluation and guarantees user privacy for personal data and model parameters while demanding less communication cost than other baselines.
    Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions. (arXiv:2306.14351v2 [stat.ME] UPDATED)
    The aim of this paper is to make clear and precise the relationship between the Rubin causal model (RCM) and structural causal model (SCM) frameworks for causal inference. Adopting a neutral logical perspective, and drawing on previous work, we show what is required for an RCM to be representable by an SCM. A key result then shows that every RCM -- including those that violate algebraic principles implied by the SCM framework -- emerges as an abstraction of some representable RCM. Finally, we illustrate the power of this conciliatory perspective by pinpointing an important role for SCM principles in classic applications of RCMs; conversely, we offer a characterization of the algebraic constraints implied by a graph, helping to substantiate further comparisons between the two frameworks.
    A Corrected Expected Improvement Acquisition Function Under Noisy Observations. (arXiv:2310.05166v2 [cs.LG] UPDATED)
    Sequential maximization of expected improvement (EI) is one of the most widely used policies in Bayesian optimization because of its simplicity and ability to handle noisy observations. In particular, the improvement function often uses the best posterior mean as the best incumbent in noisy settings. However, the uncertainty associated with the incumbent solution is often neglected in many analytic EI-type methods: a closed-form acquisition function is derived in the noise-free setting, but then applied to the setting with noisy observations. To address this limitation, we propose a modification of EI that corrects its closed-form expression by incorporating the covariance information provided by the Gaussian Process (GP) model. This acquisition function specializes to the classical noise-free result, and we argue should replace that formula in Bayesian optimization software packages, tutorials, and textbooks. This enhanced acquisition provides good generality for noisy and noiseless settings. We show that our method achieves a sublinear convergence rate on the cumulative regret bound under heteroscedastic observation noise. Our empirical results demonstrate that our proposed acquisition function can outperform EI in the presence of noisy observations on benchmark functions for black-box optimization, as well as on parameter search for neural network model compression.
    A Graph-Theoretic Framework for Understanding Open-World Semi-Supervised Learning. (arXiv:2311.03524v1 [cs.LG])
    Open-world semi-supervised learning aims at inferring both known and novel classes in unlabeled data, by harnessing prior knowledge from a labeled set with known classes. Despite its importance, there is a lack of theoretical foundations for this problem. This paper bridges the gap by formalizing a graph-theoretic framework tailored for the open-world setting, where the clustering can be theoretically characterized by graph factorization. Our graph-theoretic framework illuminates practical algorithms and provides guarantees. In particular, based on our graph formulation, we apply the algorithm called Spectral Open-world Representation Learning (SORL), and show that minimizing our loss is equivalent to performing spectral decomposition on the graph. Such equivalence allows us to derive a provable error bound on the clustering performance for both known and novel classes, and analyze rigorously when labeled data helps. Empirically, SORL can match or outperform several strong baselines on common benchmark datasets, which is appealing for practical usage while enjoying theoretical guarantees.
    Simple and Controllable Music Generation. (arXiv:2306.05284v2 [cs.SD] UPDATED)
    We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft
    SE(3) Equivariant Augmented Coupling Flows. (arXiv:2308.10364v3 [cs.LG] UPDATED)
    Coupling normalizing flows allow for fast sampling and density evaluation, making them the tool of choice for probabilistic modeling of physical systems. However, the standard coupling architecture precludes endowing flows that operate on the Cartesian coordinates of atoms with the SE(3) and permutation invariances of physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms' positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13, and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling more than an order of magnitude faster. Moreover, to the best of our knowledge, we are the first to learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms. Lastly, we demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 particle systems using only their energy functions.
    Hebbian learning inspired estimation of the linear regression parameters from queries. (arXiv:2311.03483v1 [math.ST])
    Local learning rules in biological neural networks (BNNs) are commonly referred to as Hebbian learning. [26] links a biologically motivated Hebbian learning rule to a specific zeroth-order optimization method. In this work, we study a variation of this Hebbian learning rule to recover the regression vector in the linear regression model. Zeroth-order optimization methods are known to converge with suboptimal rate for large parameter dimension compared to first-order methods like gradient descent, and are therefore thought to be in general inferior. By establishing upper and lower bounds, we show, however, that such methods achieve near-optimal rates if only queries of the linear regression loss are available. Moreover, we prove that this Hebbian learning rule can achieve considerably faster rates than any non-adaptive method that selects the queries independently of the data.
    Spatio-Temporal Similarity Measure based Multi-Task Learning for Predicting Alzheimer's Disease Progression using MRI Data. (arXiv:2311.03557v1 [cs.LG])
    Identifying and utilising various biomarkers for tracking Alzheimer's disease (AD) progression have received many recent attentions and enable helping clinicians make the prompt decisions. Traditional progression models focus on extracting morphological biomarkers in regions of interest (ROIs) from MRI/PET images, such as regional average cortical thickness and regional volume. They are effective but ignore the relationships between brain ROIs over time, which would lead to synergistic deterioration. For exploring the synergistic deteriorating relationship between these biomarkers, in this paper, we propose a novel spatio-temporal similarity measure based multi-task learning approach for effectively predicting AD progression and sensitively capturing the critical relationships between biomarkers. Specifically, we firstly define a temporal measure for estimating the magnitude and velocity of biomarker change over time, which indicate a changing trend(temporal). Converting this trend into the vector, we then compare this variability between biomarkers in a unified vector space(spatial). The experimental results show that compared with directly ROI based learning, our proposed method is more effective in predicting disease progression. Our method also enables performing longitudinal stability selection to identify the changing relationships between biomarkers, which play a key role in disease progression. We prove that the synergistic deteriorating biomarkers between cortical volumes or surface areas have a significant effect on the cognitive prediction.
    Counterfactual Data Augmentation with Contrastive Learning. (arXiv:2311.03630v1 [cs.LG])
    Statistical disparity between distinct treatment groups is one of the most significant challenges for estimating Conditional Average Treatment Effects (CATE). To address this, we introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals. Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. Theoretical analysis and experimental studies on synthetic and semi-synthetic benchmarks demonstrate that our method achieves significant improvements in both performance and robustness to overfitting across state-of-the-art models.
    Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL. (arXiv:2310.04411v2 [cs.LG] UPDATED)
    The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1 transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.  ( 3 min )
    Differentially Private Pre-Trained Model Fusion using Decentralized Federated Graph Matching. (arXiv:2311.03396v1 [cs.LG])
    Model fusion is becoming a crucial component in the context of model-as-a-service scenarios, enabling the delivery of high-quality model services to local users. However, this approach introduces privacy risks and imposes certain limitations on its applications. Ensuring secure model exchange and knowledge fusion among users becomes a significant challenge in this setting. To tackle this issue, we propose PrivFusion, a novel architecture that preserves privacy while facilitating model fusion under the constraints of local differential privacy. PrivFusion leverages a graph-based structure, enabling the fusion of models from multiple parties without necessitating retraining. By employing randomized mechanisms, PrivFusion ensures privacy guarantees throughout the fusion process. To enhance model privacy, our approach incorporates a hybrid local differentially private mechanism and decentralized federated graph matching, effectively protecting both activation values and weights. Additionally, we introduce a perturbation filter adapter to alleviate the impact of randomized noise, thereby preserving the utility of the fused model. Through extensive experiments conducted on diverse image datasets and real-world healthcare applications, we provide empirical evidence showcasing the effectiveness of PrivFusion in maintaining model performance while preserving privacy. Our contributions offer valuable insights and practical solutions for secure and collaborative data analysis within the domain of privacy-preserving model fusion.
    Formulating Discrete Probability Flow Through Optimal Transport. (arXiv:2311.03886v1 [cs.LG])
    Continuous diffusion models are commonly acknowledged to display a deterministic probability flow, whereas discrete diffusion models do not. In this paper, we aim to establish the fundamental theory for the probability flow of discrete diffusion models. Specifically, we first prove that the continuous probability flow is the Monge optimal transport map under certain conditions, and also present an equivalent evidence for discrete cases. In view of these findings, we are then able to define the discrete probability flow in line with the principles of optimal transport. Finally, drawing upon our newly established definitions, we propose a novel sampling method that surpasses previous discrete diffusion models in its ability to generate more certain outcomes. Extensive experiments on the synthetic toy dataset and the CIFAR-10 dataset have validated the effectiveness of our proposed discrete probability flow. Code is released at: https://github.com/PangzeCheung/Discrete-Probability-Flow.
    Learning Decentralized Traffic Signal Controllers with Multi-Agent Graph Reinforcement Learning. (arXiv:2311.03756v1 [cs.LG])
    This paper considers optimal traffic signal control in smart cities, which has been taken as a complex networked system control problem. Given the interacting dynamics among traffic lights and road networks, attaining controller adaptivity and scalability stands out as a primary challenge. Capturing the spatial-temporal correlation among traffic lights under the framework of Multi-Agent Reinforcement Learning (MARL) is a promising solution. Nevertheless, existing MARL algorithms ignore effective information aggregation which is fundamental for improving the learning capacity of decentralized agents. In this paper, we design a new decentralized control architecture with improved environmental observability to capture the spatial-temporal correlation. Specifically, we first develop a topology-aware information aggregation strategy to extract correlation-related information from unstructured data gathered in the road network. Particularly, we transfer the road network topology into a graph shift operator by forming a diffusion process on the topology, which subsequently facilitates the construction of graph signals. A diffusion convolution module is developed, forming a new MARL algorithm, which endows agents with the capabilities of graph learning. Extensive experiments based on both synthetic and real-world datasets verify that our proposal outperforms existing decentralized algorithms.
    Latent Diffusion for Language Generation. (arXiv:2212.09462v2 [cs.CL] UPDATED)
    Diffusion models have achieved great success in modeling continuous data modalities such as images, audio, and video, but have seen limited use in discrete domains such as language. Recent attempts to adapt diffusion to language have presented diffusion as an alternative to existing pretrained language models. We view diffusion and existing language models as complementary. We demonstrate that encoder-decoder language models can be utilized to efficiently learn high-quality language autoencoders. We then demonstrate that continuous diffusion models can be learned in the latent space of the language autoencoder, enabling us to sample continuous latent representations that can be decoded into natural language with the pretrained decoder. We validate the effectiveness of our approach for unconditional, class-conditional, and sequence-to-sequence language generation. We demonstrate across multiple diverse data sets that our latent language diffusion models are significantly more effective than previous diffusion language models.
    Unscrambling the Rectification of Adversarial Attacks Transferability across Computer Networks. (arXiv:2311.03373v1 [cs.CR])
    Convolutional neural networks (CNNs) models play a vital role in achieving state-of-the-art performances in various technological fields. CNNs are not limited to Natural Language Processing (NLP) or Computer Vision (CV) but also have substantial applications in other technological domains, particularly in cybersecurity. The reliability of CNN's models can be compromised because of their susceptibility to adversarial attacks, which can be generated effortlessly, easily applied, and transferred in real-world scenarios. In this paper, we present a novel and comprehensive method to improve the strength of attacks and assess the transferability of adversarial examples in CNNs when such strength changes, as well as whether the transferability property issue exists in computer network applications. In the context of our study, we initially examined six distinct modes of attack: the Carlini and Wagner (C&W), Fast Gradient Sign Method (FGSM), Iterative Fast Gradient Sign Method (I-FGSM), Jacobian-based Saliency Map (JSMA), Limited-memory Broyden fletcher Goldfarb Shanno (L-BFGS), and Projected Gradient Descent (PGD) attack. We applied these attack techniques on two popular datasets: the CIC and UNSW datasets. The outcomes of our experiment demonstrate that an improvement in transferability occurs in the targeted scenarios for FGSM, JSMA, LBFGS, and other attacks. Our findings further indicate that the threats to security posed by adversarial examples, even in computer network applications, necessitate the development of novel defense mechanisms to enhance the security of DL-based techniques.
    PowerFlowNet: Leveraging Message Passing GNNs for Improved Power Flow Approximation. (arXiv:2311.03415v1 [cs.LG])
    Accurate and efficient power flow (PF) analysis is crucial in modern electrical networks' efficient operation and planning. Therefore, there is a need for scalable algorithms capable of handling large-scale power networks that can provide accurate and fast solutions. Graph Neural Networks (GNNs) have emerged as a promising approach for enhancing the speed of PF approximations by leveraging their ability to capture distinctive features from the underlying power network graph. In this study, we introduce PowerFlowNet, a novel GNN architecture for PF approximation that showcases similar performance with the traditional Newton-Raphson method but achieves it 4 times faster in the simple IEEE 14-bus system and 145 times faster in the realistic case of the French high voltage network (6470rte). Meanwhile, it significantly outperforms other traditional approximation methods, such as the DC relaxation method, in terms of performance and execution time; therefore, making PowerFlowNet a highly promising solution for real-world PF analysis. Furthermore, we verify the efficacy of our approach by conducting an in-depth experimental evaluation, thoroughly examining the performance, scalability, interpretability, and architectural dependability of PowerFlowNet. The evaluation provides insights into the behavior and potential applications of GNNs in power system analysis.
    Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search. (arXiv:2311.03583v1 [cs.AI])
    This work studies a central extremal graph theory problem inspired by a 1975 conjecture of Erd\H{o}s, which aims to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this problem as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum -- jump-starting the search for larger graphs using good graphs found at smaller sizes -- we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.
    Stable Modular Control via Contraction Theory for Reinforcement Learning. (arXiv:2311.03669v1 [cs.LG])
    We propose a novel way to integrate control techniques with reinforcement learning (RL) for stability, robustness, and generalization: leveraging contraction theory to realize modularity in neural control, which ensures that combining stable subsystems can automatically preserve the stability. We realize such modularity via signal composition and dynamic decomposition. Signal composition creates the latent space, within which RL applies to maximizing rewards. Dynamic decomposition is realized by coordinate transformation that creates an auxiliary space, within which the latent signals are coupled in the way that their combination can preserve stability provided each signal, that is, each subsystem, has stable self-feedbacks. Leveraging modularity, the nonlinear stability problem is deconstructed into algebraically solvable ones, the stability of the subsystems in the auxiliary space, yielding linear constraints on the input gradients of control networks that can be as simple as switching the signs of network weights. This minimally invasive method for stability allows arguably easy integration into the modular neural architectures in machine learning, like hierarchical RL, and improves their performance. We demonstrate in simulation the necessity and the effectiveness of our method: the necessity for robustness and generalization, and the effectiveness in improving hierarchical RL for manipulation learning.
    Text Augmentations with R-drop for Classification of Tweets Self Reporting Covid-19. (arXiv:2311.03420v1 [cs.CL])
    This paper presents models created for the Social Media Mining for Health 2023 shared task. Our team addressed the first task, classifying tweets that self-report Covid-19 diagnosis. Our approach involves a classification model that incorporates diverse textual augmentations and utilizes R-drop to augment data and mitigate overfitting, boosting model efficacy. Our leading model, enhanced with R-drop and augmentations like synonym substitution, reserved words, and back translations, outperforms the task mean and median scores. Our system achieves an impressive F1 score of 0.877 on the test set.
    Graph Neural Networks for Power Grid Operational Risk Assessment. (arXiv:2311.03661v1 [eess.SY])
    In this article, the utility of graph neural network (GNN) surrogates for Monte Carlo (MC) sampling-based risk quantification in daily operations of power grid is investigated. The MC simulation process necessitates solving a large number of optimal power flow (OPF) problems corresponding to the sample values of stochastic grid variables (power demand and renewable generation), which is computationally prohibitive. Computationally inexpensive surrogates of the OPF problem provide an attractive alternative for expedited MC simulation. GNN surrogates are especially suitable due to their superior ability to handle graph-structured data. Therefore, GNN surrogates of OPF problem are trained using supervised learning. They are then used to obtain Monte Carlo (MC) samples of the quantities of interest (operating reserve, transmission line flow) given the (hours-ahead) probabilistic wind generation and load forecast. The utility of GNN surrogates is evaluated by comparing OPF-based and GNN-based grid reliability and risk for IEEE Case118 synthetic grid. It is shown that the GNN surrogates are sufficiently accurate for predicting the (bus-level, branch-level and system-level) grid state and enable fast as well as accurate operational risk quantification for power grids. The article thus develops various tools for fast reliability and risk quantification for real-world power grids using GNNs.
    Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State Decoding. (arXiv:2311.03421v1 [q-bio.NC])
    The study of brain states, ranging from highly synchronous to asynchronous neuronal patterns like the sleep-wake cycle, is fundamental for assessing the brain's spatiotemporal dynamics and their close connection to behavior. However, the development of new techniques to accurately identify them still remains a challenge, as these are often compromised by the presence of noise, artifacts, and suboptimal recording quality. In this study, we propose a two-stage computational framework combining Hopfield Networks for artifact data preprocessing with Convolutional Neural Networks (CNNs) for classification of brain states in rat neural recordings under different levels of anesthesia. To evaluate the robustness of our framework, we deliberately introduced noise artifacts into the neural recordings. We evaluated our hybrid Hopfield-CNN pipeline by benchmarking it against two comparative models: a standalone CNN handling the same noisy inputs, and another CNN trained and tested on artifact-free data. Performance across various levels of data compression and noise intensities showed that our framework can effectively mitigate artifacts, allowing the model to reach parity with the clean-data CNN at lower noise levels. Although this study mainly benefits small-scale experiments, the findings highlight the necessity for advanced deep learning and Hopfield Network models to improve scalability and robustness in diverse real-world settings.
    Loss Dynamics of Temporal Difference Reinforcement Learning. (arXiv:2307.04841v2 [stat.ML] UPDATED)
    Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.  ( 2 min )
    Size Matters: Large Graph Generation with HiGGs. (arXiv:2306.11412v2 [cs.LG] UPDATED)
    Large graphs are present in a variety of domains, including social networks, civil infrastructure, and the physical sciences to name a few. Graph generation is similarly widespread, with applications in drug discovery, network analysis and synthetic datasets among others. While GNN (Graph Neural Network) models have been applied in these domains their high in-memory costs restrict them to small graphs. Conversely less costly rule-based methods struggle to reproduce complex structures. We propose HIGGS (Hierarchical Generation of Graphs) as a model-agnostic framework of producing large graphs with realistic local structures. HIGGS uses GNN models with conditional generation capabilities to sample graphs in hierarchies of resolution. As a result HIGGS has the capacity to extend the scale of generated graphs from a given GNN model by quadratic order. As a demonstration we implement HIGGS using DiGress, a recent graph-diffusion model, including a novel edge-predictive-diffusion variant edge-DiGress. We use this implementation to generate categorically attributed graphs with tens of thousands of nodes. These HIGGS generated graphs are far larger than any previously produced using GNNs. Despite this jump in scale we demonstrate that the graphs produced by HIGGS are, on the local scale, more realistic than those from the rule-based model BTER.  ( 2 min )
    k-Means Maximum Entropy Exploration. (arXiv:2205.15623v4 [cs.LG] UPDATED)
    Exploration in high-dimensional, continuous spaces with sparse rewards is an open problem in reinforcement learning. Artificial curiosity algorithms address this by creating rewards that lead to exploration. Given a reinforcement learning algorithm capable of maximizing rewards, the problem reduces to finding an optimization objective consistent with exploration. Maximum entropy exploration uses the entropy of the state visitation distribution as such an objective. However, efficiently estimating the entropy of the state visitation distribution is challenging in high-dimensional, continuous spaces. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. The bound relies on a result we prove for non-parametric density estimation in arbitrary dimensions using k-means. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to find rewards.
    A Logic for Expressing Log-Precision Transformers. (arXiv:2210.02671v6 [cs.LG] UPDATED)
    One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformers can be equivalently expressed in a generalization of first-order logic. However, finite-precision transformers are a weak transformer variant because, as we show, a single head can only attend to a constant number of tokens and, in particular, cannot represent uniform attention. Since attending broadly is a core capability for transformers, we ask whether a minimally more expressive model that can attend universally can also be characterized in logic. To this end, we analyze transformers whose forward pass is computed in $\log n$ precision on contexts of length $n$. We prove that any log-precision transformer can be equivalently expressed as a first-order logic sentence that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. This is the tightest known upper bound and first logical characterization of log-precision transformers.  ( 2 min )
    Random Field Augmentations for Self-Supervised Representation Learning. (arXiv:2311.03629v1 [cs.CV])
    Self-supervised representation learning is heavily dependent on data augmentations to specify the invariances encoded in representations. Previous work has shown that applying diverse data augmentations is crucial to downstream performance, but augmentation techniques remain under-explored. In this work, we propose a new family of local transformations based on Gaussian random fields to generate image augmentations for self-supervised representation learning. These transformations generalize the well-established affine and color transformations (translation, rotation, color jitter, etc.) and greatly increase the space of augmentations by allowing transformation parameter values to vary from pixel to pixel. The parameters are treated as continuous functions of spatial coordinates, and modeled as independent Gaussian random fields. Empirical results show the effectiveness of the new transformations for self-supervised representation learning. Specifically, we achieve a 1.7% top-1 accuracy improvement over baseline on ImageNet downstream classification, and a 3.6% improvement on out-of-distribution iNaturalist downstream classification. However, due to the flexibility of the new transformations, learned representations are sensitive to hyperparameters. While mild transformations improve representations, we observe that strong transformations can degrade the structure of an image, indicating that balancing the diversity and strength of augmentations is important for improving generalization of learned representations.
    The Linear Representation Hypothesis and the Geometry of Large Language Models. (arXiv:2311.03658v1 [cs.CL])
    Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
    Procedural Image Programs for Representation Learning. (arXiv:2211.16412v2 [cs.CV] UPDATED)
    Learning image representations using synthetic data allows training neural networks without some of the concerns associated with real images, such as privacy and bias. Existing work focuses on a handful of curated generative processes which require expert knowledge to design, making it hard to scale up. To overcome this, we propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images. These programs are short code snippets, which are easy to modify and fast to execute using OpenGL. The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.
    Topologically Regularized Data Embeddings. (arXiv:2301.03338v2 [cs.LG] UPDATED)
    Unsupervised representation learning methods are widely used for gaining insight into high-dimensional, unstructured, or structured data. In some cases, users may have prior topological knowledge about the data, such as a known cluster structure or the fact that the data is known to lie along a tree- or graph-structured topology. However, generic methods to ensure such structure is salient in the low-dimensional representations are lacking. This negatively impacts the interpretability of low-dimensional embeddings, and plausibly downstream learning tasks. To address this issue, we introduce topological regularization: a generic approach based on algebraic topology to incorporate topological prior knowledge into low-dimensional embeddings. We introduce a class of topological loss functions, and show that jointly optimizing an embedding loss with such a topological loss function as a regularizer yields embeddings that reflect not only local proximities but also the desired topological structure. We include a self-contained overview of the required foundational concepts in algebraic topology, and provide intuitive guidance on how to design topological loss functions for a variety of shapes, such as clusters, cycles, and bifurcations. We empirically evaluate the proposed approach on computational efficiency, robustness, and versatility in combination with linear and non-linear dimensionality reduction and graph embedding methods.  ( 2 min )
    Exploring the Optimal Choice for Generative Processes in Diffusion Models: Ordinary vs Stochastic Differential Equations. (arXiv:2306.02063v2 [cs.LG] UPDATED)
    The diffusion model has shown remarkable success in computer vision, but it remains unclear whether the ODE-based probability flow or the SDE-based diffusion model is more superior and under what circumstances. Comparing the two is challenging due to dependencies on data distributions, score training, and other numerical issues. In this paper, we study the problem mathematically for two limiting scenarios: the zero diffusion (ODE) case and the large diffusion case. We first introduce a pulse-shape error to perturb the score function and analyze error accumulation of sampling quality, followed by a thorough analysis for generalization to arbitrary error. Our findings indicate that when the perturbation occurs at the end of the generative process, the ODE model outperforms the SDE model with a large diffusion coefficient. However, when the perturbation occurs earlier, the SDE model outperforms the ODE model, and we demonstrate that the error of sample generation due to such a pulse-shape perturbation is exponentially suppressed as the diffusion term's magnitude increases to infinity. Numerical validation of this phenomenon is provided using Gaussian, Gaussian mixture, and Swiss roll distribution, as well as realistic datasets like MNIST and CIFAR-10.  ( 2 min )
    Offline Policy Evaluation and Optimization under Confounding. (arXiv:2211.16583v4 [stat.ML] UPDATED)
    Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.  ( 2 min )
    Attention-Enhanced Deep Learning for Device-Free Through-the-Wall Presence Detection Using Indoor WiFi System. (arXiv:2304.13105v2 [cs.LG] UPDATED)
    Accurate detection of human presence in indoor environments is important for various applications, such as energy management and security. In this paper, we propose a novel system for human presence detection using the channel state information (CSI) of WiFi signals. Our system named attention-enhanced deep learning for presence detection (ALPD) employs an attention mechanism to automatically select informative subcarriers from the CSI data and a bidirectional long short-term memory (LSTM) network to capture temporal dependencies in CSI. Additionally, we utilize a static feature to improve the accuracy of human presence detection in static states. We evaluate the proposed ALPD system by deploying a pair of WiFi access points (APs) for collecting CSI dataset, which is further compared with several benchmarks. The results demonstrate that our ALPD system outperforms the benchmarks in terms of accuracy, especially in the presence of interference. Moreover, bidirectional transmission data is beneficial to training improving stability and accuracy, as well as reducing the costs of data collection for training. Overall, our proposed ALPD system shows promising results for human presence detection using WiFi CSI signals.  ( 2 min )
    Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects. (arXiv:2212.04922v2 [stat.ML] UPDATED)
    With the widespread application of causal inference, it is increasingly important to have tools which can test for the presence of causal effects in a diverse array of circumstances. In this vein we focus on the problem of testing for \emph{distributional} causal effects, where the treatment affects not just the mean, but also higher order moments of the distribution, as well as multidimensional or structured outcomes. We build upon a previously introduced framework, Counterfactual Mean Embeddings, for representing causal distributions within Reproducing Kernel Hilbert Spaces (RKHS) by proposing new, improved, estimators for the distributional embeddings. These improved estimators are inspired by doubly robust estimators of the causal mean, using a similar form within the kernel space. We analyse these estimators, proving they retain the doubly robust property and have improved convergence rates compared to the original estimators. This leads to new permutation based tests for distributional causal effects, using the estimators we propose as tests statistics. We experimentally and theoretically demonstrate the validity of our tests.  ( 2 min )
    Visualizing DNA reaction trajectories with deep graph embedding approaches. (arXiv:2311.03409v1 [q-bio.BM])
    Synthetic biologists and molecular programmers design novel nucleic acid reactions, with many potential applications. Good visualization tools are needed to help domain experts make sense of the complex outputs of folding pathway simulations of such reactions. Here we present ViDa, a new approach for visualizing DNA reaction folding trajectories over the energy landscape of secondary structures. We integrate a deep graph embedding model with common dimensionality reduction approaches, to map high-dimensional data onto 2D Euclidean space. We assess ViDa on two well-studied and contrasting DNA hybridization reactions. Our preliminary results suggest that ViDa's visualization successfully separates trajectories with different folding mechanisms, thereby providing useful insight to users, and is a big improvement over the current state-of-the-art in DNA kinetics visualization.
    Pipeline Parallelism for DNN Inference with Practical Performance Guarantees. (arXiv:2311.03703v1 [cs.LG])
    We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into $k$ stages and minimizing the running time of the bottleneck stage, including communication. We design practical algorithms for this NP-hard problem and show that they are nearly optimal in practice by comparing against strong lower bounds obtained via novel mixed-integer programming (MIP) formulations. We apply these algorithms and lower-bound methods to production models to achieve substantially improved approximation guarantees compared to standard combinatorial lower bounds. For example, evaluated via geometric means across production data with $k=16$ pipeline stages, our MIP formulations more than double the lower bounds, improving the approximation ratio from $2.175$ to $1.058$. This work shows that while max-throughput partitioning is theoretically hard, we have a handle on the algorithmic side of the problem in practice and much of the remaining challenge is in developing more accurate cost models to feed into the partitioning algorithms.
    Generative Diffusion Models for Lattice Field Theory. (arXiv:2311.03578v1 [hep-lat])
    This study delves into the connection between machine learning and lattice field theory by linking generative diffusion models (DMs) with stochastic quantization, from a stochastic differential equation perspective. We show that DMs can be conceptualized by reversing a stochastic process driven by the Langevin equation, which then produces samples from an initial distribution to approximate the target distribution. In a toy model, we highlight the capability of DMs to learn effective actions. Furthermore, we demonstrate its feasibility to act as a global sampler for generating configurations in the two-dimensional $\phi^4$ quantum lattice field theory.
    Training Multi-layer Neural Networks on Ising Machine. (arXiv:2311.03408v1 [cs.LG])
    As a dedicated quantum device, Ising machines could solve large-scale binary optimization problems in milliseconds. There is emerging interest in utilizing Ising machines to train feedforward neural networks due to the prosperity of generative artificial intelligence. However, existing methods can only train single-layer feedforward networks because of the complex nonlinear network topology. This paper proposes an Ising learning algorithm to train quantized neural network (QNN), by incorporating two essential techinques, namely binary representation of topological network and order reduction of loss function. As far as we know, this is the first algorithm to train multi-layer feedforward networks on Ising machines, providing an alternative to gradient-based backpropagation. Firstly, training QNN is formulated as a quadratic constrained binary optimization (QCBO) problem by representing neuron connection and activation function as equality constraints. All quantized variables are encoded by binary bits based on binary encoding protocol. Secondly, QCBO is converted to a quadratic unconstrained binary optimization (QUBO) problem, that can be efficiently solved on Ising machines. The conversion leverages both penalty function and Rosenberg order reduction, who together eliminate equality constraints and reduce high-order loss function into a quadratic one. With some assumptions, theoretical analysis shows the space complexity of our algorithm is $\mathcal{O}(H^2L + HLN\log H)$, quantifying the required number of Ising spins. Finally, the algorithm effectiveness is validated with a simulated Ising machine on MNIST dataset. After annealing 700 ms, the classification accuracy achieves 98.3%. Among 100 runs, the success probability of finding the optimal solution is 72%. Along with the increasing number of spins on Ising machine, our algorithm has the potential to train deeper neural networks.
    Causal Structure Representation Learning of Confounders in Latent Space for Recommendation. (arXiv:2311.03382v1 [cs.IR])
    Inferring user preferences from the historical feedback of users is a valuable problem in recommender systems. Conventional approaches often rely on the assumption that user preferences in the feedback data are equivalent to the real user preferences without additional noise, which simplifies the problem modeling. However, there are various confounders during user-item interactions, such as weather and even the recommendation system itself. Therefore, neglecting the influence of confounders will result in inaccurate user preferences and suboptimal performance of the model. Furthermore, the unobservability of confounders poses a challenge in further addressing the problem. To address these issues, we refine the problem and propose a more rational solution. Specifically, we consider the influence of confounders, disentangle them from user preferences in the latent space, and employ causal graphs to model their interdependencies without specific labels. By cleverly combining local and global causal graphs, we capture the user-specificity of confounders on user preferences. We theoretically demonstrate the identifiability of the obtained causal graph. Finally, we propose our model based on Variational Autoencoders, named Causal Structure representation learning of Confounders in latent space (CSC). We conducted extensive experiments on one synthetic dataset and five real-world datasets, demonstrating the superiority of our model. Furthermore, we demonstrate that the learned causal representations of confounders are controllable, potentially offering users fine-grained control over the objectives of their recommendation lists with the learned causal graphs.
    Can We Trust the Similarity Measurement in Federated Learning?. (arXiv:2311.03369v1 [cs.LG])
    Is it secure to measure the reliability of local models by similarity in federated learning (FL)? This paper delves into an unexplored security threat concerning applying similarity metrics, such as the L_2 norm, Euclidean distance, and cosine similarity, in protecting FL. We first uncover the deficiencies of similarity metrics that high-dimensional local models, including benign and poisoned models, may be evaluated to have the same similarity while being significantly different in the parameter values. We then leverage this finding to devise a novel untargeted model poisoning attack, Faker, which launches the attack by simultaneously maximizing the evaluated similarity of the poisoned local model and the difference in the parameter values. Experimental results based on seven datasets and eight defenses show that Faker outperforms the state-of-the-art benchmark attacks by 1.1-9.0X in reducing accuracy and 1.2-8.0X in saving time cost, which even holds for the case of a single malicious client with limited knowledge about the FL system. Moreover, Faker can degrade the performance of the global model by attacking only once. We also preliminarily explore extending Faker to other attacks, such as backdoor attacks and Sybil attacks. Lastly, we provide a model evaluation strategy, called the similarity of partial parameters (SPP), to defend against Faker. Given that numerous mechanisms in FL utilize similarity metrics to assess local models, this work suggests that we should be vigilant regarding the potential risks of using these metrics.
    CMIP X-MOS: Improving Climate Models with Extreme Model Output Statistics. (arXiv:2311.03370v1 [physics.ao-ph])
    Climate models are essential for assessing the impact of greenhouse gas emissions on our changing climate and the resulting increase in the frequency and severity of natural disasters. Despite the widespread acceptance of climate models produced by the Coupled Model Intercomparison Project (CMIP), they still face challenges in accurately predicting climate extremes, which pose most significant threats to both people and the environment. To address this limitation and improve predictions of natural disaster risks, we introduce Extreme Model Output Statistics (X-MOS). This approach utilizes deep regression techniques to precisely map CMIP model outputs to real measurements obtained from weather stations, which results in a more accurate analysis of the XXI climate extremes. In contrast to previous research, our study places a strong emphasis on enhancing the estimation of the tails of future climate parameter distributions. The latter supports decision-makers, enabling them to better assess climate-related risks across the globe.
    Transferability and explainability of deep learning emulators for regional climate model projections: Perspectives for future applications. (arXiv:2311.03378v1 [physics.ao-ph])
    Regional climate models (RCMs) are essential tools for simulating and studying regional climate variability and change. However, their high computational cost limits the production of comprehensive ensembles of regional climate projections covering multiple scenarios and driving Global Climate Models (GCMs) across regions. RCM emulators based on deep learning models have recently been introduced as a cost-effective and promising alternative that requires only short RCM simulations to train the models. Therefore, evaluating their transferability to different periods, scenarios, and GCMs becomes a pivotal and complex task in which the inherent biases of both GCMs and RCMs play a significant role. Here we focus on this problem by considering the two different emulation approaches proposed in the literature (PP and MOS, following the terminology introduced in this paper). In addition to standard evaluation techniques, we expand the analysis with methods from the field of eXplainable Artificial Intelligence (XAI), to assess the physical consistency of the empirical links learnt by the models. We find that both approaches are able to emulate certain climatological properties of RCMs for different periods and scenarios (soft transferability), but the consistency of the emulation functions differ between approaches. Whereas PP learns robust and physically meaningful patterns, MOS results are GCM-dependent and lack physical consistency in some cases. Both approaches face problems when transferring the emulation function to other GCMs, due to the existence of GCM-dependent biases (hard transferability). This limits their applicability to build ensembles of regional climate projections. We conclude by giving some prospects for future applications.
    Kernel-based Joint Multiple Graph Learning and Clustering of Graph Signals. (arXiv:2310.19005v2 [eess.SP] UPDATED)
    Within the context of Graph Signal Processing (GSP), Graph Learning (GL) is concerned with the inference of the graph's underlying structure from nodal observations. However, real-world data often contains diverse information, necessitating the simultaneous clustering and learning of multiple graphs. In practical applications, valuable node-specific covariates, represented as kernels, have been underutilized by existing graph signal clustering methods. In this letter, we propose a new framework, named Kernel-based joint Multiple GL and clustering of graph signals (KMGL), that leverages a multi-convex optimization approach. This allows us to integrate node-side information, construct low-pass filters, and efficiently solve the optimization problem. The experiments demonstrate that KMGL significantly enhances the robustness of GL and clustering, particularly in scenarios with high noise levels and a substantial number of clusters. These findings underscore the potential of KMGL for improving the performance of GSP methods in diverse, real-world applications.
    Birth of a Transformer: A Memory Viewpoint. (arXiv:2306.00802v2 [stat.ML] UPDATED)
    Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.
    Multilingual Mathematical Autoformalization. (arXiv:2311.03755v1 [cs.CL])
    Autoformalization is the task of translating natural language materials into machine-verifiable formalisations. Progress in autoformalization research is hindered by the lack of a sizeable dataset consisting of informal-formal pairs expressing the same essence. Existing methods tend to circumvent this challenge by manually curating small corpora or using few-shot learning with large language models. But these methods suffer from data scarcity and formal language acquisition difficulty. In this work, we create $\texttt{MMA}$, a large, flexible, multilingual, and multi-domain dataset of informal-formal pairs, by using a language model to translate in the reverse direction, that is, from formal mathematical statements into corresponding informal ones. Experiments show that language models fine-tuned on $\texttt{MMA}$ produce $16-18\%$ of statements acceptable with minimal corrections on the $\texttt{miniF2F}$ and $\texttt{ProofNet}$ benchmarks, up from $0\%$ with the base model. We demonstrate that fine-tuning on multilingual formal data results in more capable autoformalization models even when deployed on monolingual tasks.
    Convergence Analysis of Mean Shift. (arXiv:2305.08463v3 [stat.ML] UPDATED)
    The mean shift (MS) algorithm seeks a mode of the kernel density estimate (KDE). This study presents a convergence guarantee of the mode estimate sequence generated by the MS algorithm and an evaluation of the convergence rate, under fairly mild conditions, with the help of the argument concerning the {\L}ojasiewicz inequality. Our findings extend existing ones covering analytic kernels and the Epanechnikov kernel. Those are significant in that they cover the biweight kernel, which is optimal among non-negative kernels in terms of the asymptotic statistical efficiency for the KDE-based mode estimation.
    Determination of droplet size from wide-angle light scattering image data using convolutional neural networks. (arXiv:2311.03387v1 [cs.CV])
    Wide-angle light scattering (WALS) offers the possibility of a highly temporally and spatially resolved measurement of droplets in spray-based methods for nanoparticle synthesis. The size of these droplets is a critical variable affecting the final properties of synthesized materials such as hetero-aggregates. However, conventional methods for determining droplet sizes from WALS image data are labor-intensive and may introduce biases, particularly when applied to complex systems like spray flame synthesis (SFS). To address these challenges, we introduce a fully automatic machine learning-based approach that employs convolutional neural networks (CNNs) in order to streamline the droplet sizing process. This CNN-based methodology offers further advantages: it requires few manual labels and can utilize transfer learning, making it a promising alternative to conventional methods, specifically with respect to efficiency. To evaluate the performance of our machine learning models, we consider WALS data from an ethanol spray flame process at various heights above the burner surface (HABs), where the models are trained and cross-validated on a large dataset comprising nearly 35000 WALS images.
    In-Context Exemplars as Clues to Retrieving from Large Associative Memory. (arXiv:2311.03498v1 [cs.CL])
    Recently, large language models (LLMs) have made remarkable progress in natural language processing. The most representative ability of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. The performance of ICL greatly depends on the exemplars used. However, how to choose exemplars remains unclear due to the lack of understanding of how in-context learning works. In this paper, we present a novel perspective on ICL by conceptualizing it as contextual retrieval from a model of associative memory. We establish a theoretical framework of ICL based on Hopfield Networks. Based on our framework, we look into how in-context exemplars influence the performance of ICL and propose more efficient active exemplar selection. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLMs.
    When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method. (arXiv:2211.10955v3 [cs.LG] UPDATED)
    Real-world large-scale datasets are both noisily labeled and class-imbalanced. The issues seriously hurt the generalization of trained models. It is hence significant to address the simultaneous incorrect labeling and class-imbalance, i.e., the problem of learning with noisy labels on long-tailed data. Previous works develop several methods for the problem. However, they always rely on strong assumptions that are invalid or hard to be checked in practice. In this paper, to handle the problem and address the limitations of prior works, we propose a representation calibration method RCAL. Specifically, RCAL works with the representations extracted by unsupervised contrastive learning. We assume that without incorrect labeling and class imbalance, the representations of instances in each class conform to a multivariate Gaussian distribution, which is much milder and easier to be checked. Based on the assumption, we recover underlying representation distributions from polluted ones resulting from mislabeled and class-imbalanced data. Additional data points are then sampled from the recovered distributions to help generalization. Moreover, during classifier training, representation learning takes advantage of representation robustness brought by contrastive learning, which further improves the classifier performance. We derive theoretical results to discuss the effectiveness of our representation calibration. Experiments on multiple benchmarks justify our claims and confirm the superiority of the proposed method.  ( 3 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v3 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
    Rethinking Symbolic Regression Datasets and Benchmarks for Scientific Discovery. (arXiv:2206.10540v4 [cs.LG] UPDATED)
    This paper revisits datasets and evaluation criteria for Symbolic Regression (SR), specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. We also create another 120 datasets that contain dummy variables to examine whether SR methods can choose necessary variables only. Besides, we propose to use normalized edit distances (NED) between a predicted equation and the true equation trees for addressing a critical issue that existing SR metrics are either binary or errors between the target values and an SR model's predicted values for a given input. We conduct benchmark experiments on our new SRSD datasets using various representative SR methods. The experimental results show that we provide a more realistic performance evaluation, and our user study shows that the NED correlates with human judges significantly more than an existing SR metric.  ( 3 min )
    Ensembling Textual and Structure-Based Models for Knowledge Graph Completion. (arXiv:2311.03780v1 [cs.CL])
    We consider two popular approaches to Knowledge Graph Completion (KGC): textual models that rely on textual entity descriptions, and structure-based models that exploit the connectivity structure of the Knowledge Graph (KG). Preliminary experiments show that these approaches have complementary strengths: structure-based models perform well when the gold answer is easily reachable from the query head in the KG, while textual models exploit descriptions to give good performance even when the gold answer is not reachable. In response, we explore ensembling as a way of combining the best of both approaches. We propose a novel method for learning query-dependent ensemble weights by using the distributions of scores assigned by individual models to all candidate entities. Our ensemble baseline achieves state-of-the-art results on three standard KGC datasets, with up to 6.8 pt MRR and 8.3 pt Hits@1 gains over best individual models.  ( 2 min )
    Using Sum-Product Networks to Assess Uncertainty in Deep Active Learning. (arXiv:2206.09798v2 [cs.LG] UPDATED)
    The success of deep active learning hinges on the choice of an effective acquisition function, which ranks not yet labeled data points according to their expected informativeness. Many acquisition functions are (partly) based on the uncertainty that the current model has about the class label of a point, yet there is no generally agreed upon strategy for computing such uncertainty. This paper proposes a new and very simple approach to computing uncertainty in deep active learning with a Convolutional Neural Network (CNN). The main idea is to use the feature representation extracted by the CNN as data for training a Sum-Product Network (SPN). Since SPNs are typically used for estimating the distribution of a dataset, they are well suited to the task of estimating class probabilities that can be used directly by standard acquisition functions such as max entropy and variational ratio. The effectiveness of our method is demonstrated in an experimental study on several standard benchmark datasets for image classification, where we compare it to various state-of-the-art methods for assessing uncertainty in deep active learning.  ( 2 min )
    Probabilistic Categorical Adversarial Attack & Adversarial Training. (arXiv:2210.09364v3 [cs.LG] UPDATED)
    The existence of adversarial examples brings huge concern for people to apply Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate adversarial examples with categorical data is an important problem but lack of extensive exploration. Previously established methods leverage greedy search method, which can be very time-consuming to conduct successful attack. This also limits the development of adversarial training and potential defenses for categorical data. To tackle this problem, we propose Probabilistic Categorical Adversarial Attack (PCAA), which transfers the discrete optimization problem to a continuous problem that can be solved efficiently by Projected Gradient Descent. In our paper, we theoretically analyze its optimality and time complexity to demonstrate its significant advantage over current greedy based attacks. Moreover, based on our attack, we propose an efficient adversarial training framework. Through a comprehensive empirical study, we justify the effectiveness of our proposed attack and defense algorithms.  ( 2 min )
    Accurate 3D Object Detection using Energy-Based Models. (arXiv:2012.04634v2 [cs.CV] UPDATED)
    Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at https://github.com/fregu856/ebms_3dod.  ( 2 min )
    Neuro-GPT: Developing A Foundation Model for EEG. (arXiv:2311.03764v1 [cs.LG])
    To handle the scarcity and heterogeneity of electroencephalography (EEG) data in Brain-Computer Interface (BCI) tasks, and to harness the vast public data, we propose Neuro-GPT, a foundation model consisting of an EEG encoder and a GPT model. The foundation model is pre-trained on a large-scale public EEG dataset, using a self-supervised task which learns how to reconstruct the masked chunk in EEG. We then fine-tune the foundation model on a Motor Imagery Classification task where only 9 subjects are available. Experiments demonstrated that applying foundation model can significantly improve classification performance compared to the model trained from scratch, which provides evidence for the advanced generalizability of foundation model and the ability to address the challenges of data scarcity and heterogeneity.  ( 2 min )
    Asynchronous Local Computations in Distributed Bayesian Learning. (arXiv:2311.03496v1 [cs.LG])
    Due to the expanding scope of machine learning (ML) to the fields of sensor networking, cooperative robotics and many other multi-agent systems, distributed deployment of inference algorithms has received a lot of attention. These algorithms involve collaboratively learning unknown parameters from dispersed data collected by multiple agents. There are two competing aspects in such algorithms, namely, intra-agent computation and inter-agent communication. Traditionally, algorithms are designed to perform both synchronously. However, certain circumstances need frugal use of communication channels as they are either unreliable, time-consuming, or resource-expensive. In this paper, we propose gossip-based asynchronous communication to leverage fast computations and reduce communication overhead simultaneously. We analyze the effects of multiple (local) intra-agent computations by the active agents between successive inter-agent communications. For local computations, Bayesian sampling via unadjusted Langevin algorithm (ULA) MCMC is utilized. The communication is assumed to be over a connected graph (e.g., as in decentralized learning), however, the results can be extended to coordinated communication where there is a central server (e.g., federated learning). We theoretically quantify the convergence rates in the process. To demonstrate the efficacy of the proposed algorithm, we present simulations on a toy problem as well as on real world data sets to train ML models to perform classification tasks. We observe faster initial convergence and improved performance accuracy, especially in the low data range. We achieve on average 78% and over 90% classification accuracy respectively on the Gamma Telescope and mHealth data sets from the UCI ML repository.  ( 3 min )
    Context Shift Reduction for Offline Meta-Reinforcement Learning. (arXiv:2311.03695v1 [cs.LG])
    Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks. However, the context shift problem arises due to the distribution discrepancy between the contexts used for training (from the behavior policy) and testing (from the exploration policy). The context shift problem leads to incorrect task inference and further deteriorates the generalization ability of the meta-policy. Existing OMRL methods either overlook this problem or attempt to mitigate it with additional information. In this paper, we propose a novel approach called Context Shift Reduction for OMRL (CSRO) to address the context shift problem with only offline datasets. The key insight of CSRO is to minimize the influence of policy in context during both the meta-training and meta-test phases. During meta-training, we design a max-min mutual information representation learning mechanism to diminish the impact of the behavior policy on task representation. In the meta-test phase, we introduce the non-prior context collection strategy to reduce the effect of the exploration policy. Experimental results demonstrate that CSRO significantly reduces the context shift and improves the generalization ability, surpassing previous methods across various challenging domains.  ( 2 min )
    Sparse Interaction Additive Networks via Feature Interaction Detection and Sparse Selection. (arXiv:2209.09326v2 [cs.LG] UPDATED)
    There is currently a large gap in performance between the statistically rigorous methods like linear regression or additive splines and the powerful deep methods using neural networks. Previous works attempting to close this gap have failed to fully investigate the exponentially growing number of feature combinations which deep networks consider automatically during training. In this work, we develop a tractable selection algorithm to efficiently identify the necessary feature combinations by leveraging techniques in feature interaction detection. Our proposed Sparse Interaction Additive Networks (SIAN) construct a bridge from these simple and interpretable models to fully connected neural networks. SIAN achieves competitive performance against state-of-the-art methods across multiple large-scale tabular datasets and consistently finds an optimal tradeoff between the modeling capacity of neural networks and the generalizability of simpler methods.  ( 2 min )
    Improved weight initialization for deep and narrow feedforward neural network. (arXiv:2311.03733v1 [cs.LG])
    Appropriate weight initialization settings, along with the ReLU activation function, have been a cornerstone of modern deep learning, making it possible to train and deploy highly effective and efficient neural network models across diverse artificial intelligence. The problem of dying ReLU, where ReLU neurons become inactive and yield zero output, presents a significant challenge in the training of deep neural networks with ReLU activation function. Theoretical research and various methods have been introduced to address the problem. However, even with these methods and research, training remains challenging for extremely deep and narrow feedforward networks with ReLU activation function. In this paper, we propose a new weight initialization method to address this issue. We prove the properties of the proposed initial weight matrix and demonstrate how these properties facilitate the effective propagation of signal vectors. Through a series of experiments and comparisons with existing methods, we demonstrate the effectiveness of the new initialization method.  ( 2 min )
    UP-NeRF: Unconstrained Pose-Prior-Free Neural Radiance Fields. (arXiv:2311.03784v1 [cs.CV])
    Neural Radiance Field (NeRF) has enabled novel view synthesis with high fidelity given images and camera poses. Subsequent works even succeeded in eliminating the necessity of pose priors by jointly optimizing NeRF and camera pose. However, these works are limited to relatively simple settings such as photometrically consistent and occluder-free image collections or a sequence of images from a video. So they have difficulty handling unconstrained images with varying illumination and transient occluders. In this paper, we propose \textbf{UP-NeRF} (\textbf{U}nconstrained \textbf{P}ose-prior-free \textbf{Ne}ural \textbf{R}adiance \textbf{F}ields) to optimize NeRF with unconstrained image collections without camera pose prior. We tackle these challenges with surrogate tasks that optimize color-insensitive feature fields and a separate module for transient occluders to block their influence on pose estimation. In addition, we introduce a candidate head to enable more robust pose estimation and transient-aware depth supervision to minimize the effect of incorrect prior. Our experiments verify the superior performance of our method compared to the baselines including BARF and its variants in a challenging internet photo collection, \textit{Phototourism} dataset. The code of UP-NeRF is available at \url{https://github.com/mlvlab/UP-NeRF}.  ( 2 min )
    ProPath: Disease-Specific Protein Language Model for Variant Pathogenicity. (arXiv:2311.03429v1 [q-bio.GN])
    Clinical variant classification of pathogenic versus benign genetic variants remains a pivotal challenge in clinical genetics. Recently, the proposition of protein language models has improved the generic variant effect prediction (VEP) accuracy via weakly-supervised or unsupervised training. However, these VEPs are not disease-specific, limiting their adaptation at point-of-care. To address this problem, we propose a disease-specific \textsc{pro}tein language model for variant \textsc{path}ogenicity, termed ProPath, to capture the pseudo-log-likelihood ratio in rare missense variants through a siamese network. We evaluate the performance of ProPath against pre-trained language models, using clinical variant sets in inherited cardiomyopathies and arrhythmias that were not seen during training. Our results demonstrate that ProPath surpasses the pre-trained ESM1b with an over $5\%$ improvement in AUC across both datasets. Furthermore, our model achieved the highest performances across all baselines for both datasets. Thus, our ProPath offers a potent disease-specific variant effect prediction, particularly valuable for disease associations and clinical applicability.  ( 2 min )
    Generalizability of Adversarial Robustness Under Distribution Shifts. (arXiv:2209.15042v3 [cs.LG] UPDATED)
    Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution on which the model was trained. However, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation significantly boosts the generalization of robustness with minimal effect on clean data accuracy.  ( 2 min )
    An Explainable Framework for Machine learning-Based Reactive Power Optimization of Distribution Network. (arXiv:2311.03863v1 [eess.SY])
    To reduce the heavy computational burden of reactive power optimization of distribution networks, machine learning models are receiving increasing attention. However, most machine learning models (e.g., neural networks) are usually considered as black boxes, making it challenging for power system operators to identify and comprehend potential biases or errors in the decision-making process of machine learning models. To address this issue, an explainable machine-learning framework is proposed to optimize the reactive power in distribution networks. Firstly, a Shapley additive explanation framework is presented to measure the contribution of each input feature to the solution of reactive power optimizations generated from machine learning models. Secondly, a model-agnostic approximation method is developed to estimate Shapley values, so as to avoid the heavy computational burden associated with direct calculations of Shapley values. The simulation results show that the proposed explainable framework can accurately explain the solution of the machine learning model-based reactive power optimization by using visual analytics, from both global and instance perspectives. Moreover, the proposed explainable framework is model-agnostic, and thus applicable to various models (e.g., neural networks).  ( 2 min )
    Structure of universal formulas. (arXiv:2311.03910v1 [cs.LG])
    By universal formulas we understand parameterized analytic expressions that have a fixed complexity, but nevertheless can approximate any continuous function on a compact set. There exist various examples of such formulas, including some in the form of neural networks. In this paper we analyze the essential structural elements of these highly expressive models. We introduce a hierarchy of expressiveness classes connecting the global approximability property to the weaker property of infinite VC dimension, and prove a series of classification results for several increasingly complex functional families. In particular, we introduce a general family of polynomially-exponentially-algebraic functions that, as we prove, is subject to polynomial constraints. As a consequence, we show that fixed-size neural networks with not more than one layer of neurons having transcendental activations (e.g., sine or standard sigmoid) cannot in general approximate functions on arbitrary finite sets. On the other hand, we give examples of functional families, including two-hidden-layer neural networks, that approximate functions on arbitrary finite sets, but fail to do that on the whole domain of definition.  ( 2 min )
    SeRO: Self-Supervised Reinforcement Learning for Recovery from Out-of-Distribution Situations. (arXiv:2311.03651v1 [cs.LG])
    Robotic agents trained using reinforcement learning have the problem of taking unreliable actions in an out-of-distribution (OOD) state. Agents can easily become OOD in real-world environments because it is almost impossible for them to visit and learn the entire state space during training. Unfortunately, unreliable actions do not ensure that agents perform their original tasks successfully. Therefore, agents should be able to recognize whether they are in OOD states and learn how to return to the learned state distribution rather than continue to take unreliable actions. In this study, we propose a novel method for retraining agents to recover from OOD situations in a self-supervised manner when they fall into OOD states. Our in-depth experimental results demonstrate that our method substantially improves the agent's ability to recover from OOD situations in terms of sample efficiency and restoration of the performance for the original tasks. Moreover, we show that our method can retrain the agent to recover from OOD situations even when in-distribution states are difficult to visit through exploration.  ( 2 min )
    Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds. (arXiv:2311.03760v1 [cs.LG])
    Among various acquisition functions (AFs) in Bayesian optimization (BO), Gaussian process upper confidence bound (GP-UCB) and Thompson sampling (TS) are well-known options with established theoretical properties regarding Bayesian cumulative regret (BCR). Recently, it has been shown that a randomized variant of GP-UCB achieves a tighter BCR bound compared with GP-UCB, which we call the tighter BCR bound for brevity. Inspired by this study, this paper first shows that TS achieves the tighter BCR bound. On the other hand, GP-UCB and TS often practically suffer from manual hyperparameter tuning and over-exploration issues, respectively. To overcome these difficulties, we propose yet another AF called a probability of improvement from the maximum of a sample path (PIMS). We show that PIMS achieves the tighter BCR bound and avoids the hyperparameter tuning, unlike GP-UCB. Furthermore, we demonstrate a wide range of experiments, focusing on the effectiveness of PIMS that mitigates the practical issues of GP-UCB and TS.  ( 2 min )
    Bandit Pareto Set Identification: the Fixed Budget Setting. (arXiv:2311.03992v1 [stat.ML])
    We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.  ( 2 min )
    TWIST: Teacher-Student World Model Distillation for Efficient Sim-to-Real Transfer. (arXiv:2311.03622v1 [cs.RO])
    Model-based RL is a promising approach for real-world robotics due to its improved sample efficiency and generalization capabilities compared to model-free RL. However, effective model-based RL solutions for vision-based real-world applications require bridging the sim-to-real gap for any world model learnt. Due to its significant computational cost, standard domain randomisation does not provide an effective solution to this problem. This paper proposes TWIST (Teacher-Student World Model Distillation for Sim-to-Real Transfer) to achieve efficient sim-to-real transfer of vision-based model-based RL using distillation. Specifically, TWIST leverages state observations as readily accessible, privileged information commonly garnered from a simulator to significantly accelerate sim-to-real transfer. Specifically, a teacher world model is trained efficiently on state information. At the same time, a matching dataset is collected of domain-randomised image observations. The teacher world model then supervises a student world model that takes the domain-randomised image observations as input. By distilling the learned latent dynamics model from the teacher to the student model, TWIST achieves efficient and effective sim-to-real transfer for vision-based model-based RL tasks. Experiments in simulated and real robotics tasks demonstrate that our approach outperforms naive domain randomisation and model-free methods in terms of sample efficiency and task performance of sim-to-real transfer.  ( 2 min )
    Neural Rankers for Code Generation via Inter-Cluster Modeling. (arXiv:2311.03366v1 [cs.SE])
    Code Large Language Models (CodeLLMs) have ushered in a new era of code generation advancements. However, selecting the best solutions from among all possible CodeLLM solutions remains a challenge. Previous methods frequently overlooked the intricate functional similarities and interactions between clusters, resulting in suboptimal results. In this work, we introduce \textit{SRank}, a novel reranking strategy for selecting the best solution from code generation that focuses on modeling inter-cluster relationship. By quantifying the functional overlap between clusters, our approach provides a better ranking strategy of code solutions. Empirical results show that our method achieves a remarkable results on pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66\% in pass@1 with Codex002, 75.31\% for WizardCoder, 53.99\% for StarCoder and 60.55\% for CodeGen, which surpass the state-of-the-arts solution ranking methods, such as CodeT and Coder-Reviewer on the same CodeLLM with significant margin ($\approx 6.1\%$ improvement on average). Comparing to the random sampling method, we can achieve an average improvement of $\approx 23.07\%$ on Human-Eval and 17.64\% on MBPP. Even in scenarios with limited test inputs, our approach demonstrates robustness and superiority, marking a new state-of-the-arts in code generation reranking.
    Learning to Learn for Few-shot Continual Active Learning. (arXiv:2311.03732v1 [cs.LG])
    Continual learning strives to ensure stability in solving previously seen tasks while demonstrating plasticity in a novel domain. Recent advances in CL are mostly confined to a supervised learning setting, especially in NLP domain. In this work, we consider a few-shot continual active learning (CAL) setting where labeled data is inadequate, and unlabeled data is abundant but with a limited annotation budget. We propose a simple but efficient method, called Meta-Continual Active Learning. Specifically, we employ meta-learning and experience replay to address the trade-off between stability and plasticity. As a result, it finds an optimal initialization that efficiently utilizes annotated information for fast adaptation while preventing catastrophic forgetting of past tasks. We conduct extensive experiments to validate the effectiveness of the proposed method and analyze the effect of various active learning strategies and memory sample selection methods in a few-shot CAL setup. Our experiment results demonstrate that random sampling is the best default strategy for both active learning and memory sample selection to solve few-shot CAL problems.
    Augmenting Radio Signals with Wavelet Transform for Deep Learning-Based Modulation Recognition. (arXiv:2311.03761v1 [cs.LG])
    The use of deep learning for radio modulation recognition has become prevalent in recent years. This approach automatically extracts high-dimensional features from large datasets, facilitating the accurate classification of modulation schemes. However, in real-world scenarios, it may not be feasible to gather sufficient training data in advance. Data augmentation is a method used to increase the diversity and quantity of training dataset and to reduce data sparsity and imbalance. In this paper, we propose data augmentation methods that involve replacing detail coefficients decomposed by discrete wavelet transform for reconstructing to generate new samples and expand the training set. Different generation methods are used to generate replacement sequences. Simulation results indicate that our proposed methods significantly outperform the other augmentation methods.
    Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. (arXiv:2110.00604v3 [math.OC] UPDATED)
    Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with nonlinear and possibly nonconvex lower-level constraints. We also present a comprehensive convergence theory that addresses both the lower-level unconstrained and constrained cases and covers all inexact calculations of the adjoint gradient (also called hypergradient), such as the inexact solution of the lower-level problem, inexact computation of the adjoint formula (due to the inexact solution of the adjoint equation or use of a truncated Neumann series), and noisy estimates of the gradients, Hessians, and Jacobians involved. To promote the use of bilevel optimization in large-scale learning, we have developed new low-rank practical bilevel stochastic gradient methods (BSG-N-FD and~BSG-1) that do not require second-order derivatives and, in the lower-level unconstrained case, dismiss any matrix-vector products.  ( 2 min )
    Geodesic Multi-Modal Mixup for Robust Fine-Tuning. (arXiv:2203.03897v4 [cs.CV] UPDATED)
    Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of uniformity-alignment to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup  ( 3 min )
    FD-MIA: Efficient Attacks on Fairness-enhanced Models. (arXiv:2311.03865v1 [cs.LG])
    Previous studies have developed fairness methods for biased models that exhibit discriminatory behaviors towards specific subgroups. While these models have shown promise in achieving fair predictions, recent research has identified their potential vulnerability to score-based membership inference attacks (MIAs). In these attacks, adversaries can infer whether a particular data sample was used during training by analyzing the model's prediction scores. However, our investigations reveal that these score-based MIAs are ineffective when targeting fairness-enhanced models in binary classifications. The attack models trained to launch the MIAs degrade into simplistic threshold models, resulting in lower attack performance. Meanwhile, we observe that fairness methods often lead to prediction performance degradation for the majority subgroups of the training data. This raises the barrier to successful attacks and widens the prediction gaps between member and non-member data. Building upon these insights, we propose an efficient MIA method against fairness-enhanced models based on fairness discrepancy results (FD-MIA). It leverages the difference in the predictions from both the original and fairness-enhanced models and exploits the observed prediction gaps as attack clues. We also explore potential strategies for mitigating privacy leakages. Extensive experiments validate our findings and demonstrate the efficacy of the proposed method.  ( 2 min )
    PT-Tuning: Bridging the Gap between Time Series Masked Reconstruction and Forecasting via Prompt Token Tuning. (arXiv:2311.03768v1 [cs.LG])
    Self-supervised learning has been actively studied in time series domain recently, especially for masked reconstruction. Most of these methods follow the "Pre-training + Fine-tuning" paradigm in which a new decoder replaces the pre-trained decoder to fit for a specific downstream task, leading to inconsistency of upstream and downstream tasks. In this paper, we first point out that the unification of task objectives and adaptation for task difficulty are critical for bridging the gap between time series masked reconstruction and forecasting. By reserving the pre-trained mask token during fine-tuning stage, the forecasting task can be taken as a special case of masked reconstruction, where the future values are masked and reconstructed based on history values. It guarantees the consistency of task objectives but there is still a gap in task difficulty. Because masked reconstruction can utilize contextual information while forecasting can only use historical information to reconstruct. To further mitigate the existed gap, we propose a simple yet effective prompt token tuning (PT-Tuning) paradigm, in which all pre-trained parameters are frozen and only a few trainable prompt tokens are added to extended mask tokens in element-wise manner. Extensive experiments on real-world datasets demonstrate the superiority of our proposed paradigm with state-of-the-art performance compared to representation learning and end-to-end supervised forecasting methods.  ( 3 min )
    Aspects of human memory and Large Language Models. (arXiv:2311.03839v1 [cs.CL])
    Large Language Models (LLMs) are huge artificial neural networks which primarily serve to generate text, but also provide a very sophisticated probabilistic model of language use. Since generating a semantically consistent text requires a form of effective memory, we investigate the memory properties of LLMs and find surprising similarities with key characteristics of human memory. This result strongly suggests that the biological features of human memory leave an imprint on the way that we structure our textual narratives.  ( 2 min )
    Cup Curriculum: Curriculum Learning on Model Capacity. (arXiv:2311.03956v1 [cs.LG])
    Curriculum learning (CL) aims to increase the performance of a learner on a given task by applying a specialized learning strategy. This strategy focuses on either the dataset, the task, or the model. There is little to no work analysing the possibilities to apply CL on the model capacity in natural language processing. To close this gap, we propose the cup curriculum. In a first phase of training we use a variation of iterative magnitude pruning to reduce model capacity. These weights are reintroduced in a second phase, resulting in the model capacity to show a cup-shaped curve over the training iterations. We empirically evaluate different strategies of the cup curriculum and show that it outperforms early stopping reliably while exhibiting a high resilience to overfitting.  ( 2 min )
    User-level Differentially Private Stochastic Convex Optimization: Efficient Algorithms with Optimal Rates. (arXiv:2311.03797v1 [cs.LG])
    We study differentially private stochastic convex optimization (DP-SCO) under user-level privacy, where each user may hold multiple data items. Existing work for user-level DP-SCO either requires super-polynomial runtime [Ghazi et al. (2023)] or requires the number of users to grow polynomially with the dimensionality of the problem with additional strict assumptions [Bassily et al. (2023)]. We develop new algorithms for user-level DP-SCO that obtain optimal rates for both convex and strongly convex functions in polynomial time and require the number of users to grow only logarithmically in the dimension. Moreover, our algorithms are the first to obtain optimal rates for non-smooth functions in polynomial time. These algorithms are based on multiple-pass DP-SGD, combined with a novel private mean estimation procedure for concentrated data, which applies an outlier removal step before estimating the mean of the gradients.  ( 2 min )
    Unsupervised Video Summarization. (arXiv:2311.03745v1 [cs.CV])
    This paper introduces a new, unsupervised method for automatic video summarization using ideas from generative adversarial networks but eliminating the discriminator, having a simple loss function, and separating training of different parts of the model. An iterative training strategy is also applied by alternately training the reconstructor and the frame selector for multiple iterations. Furthermore, a trainable mask vector is added to the model in summary generation during training and evaluation. The method also includes an unsupervised model selection algorithm. Results from experiments on two public datasets (SumMe and TVSum) and four datasets we created (Soccer, LoL, MLB, and ShortMLB) demonstrate the effectiveness of each component on the model performance, particularly the iterative training strategy. Evaluations and comparisons with the state-of-the-art methods highlight the advantages of the proposed method in performance, stability, and training efficiency.  ( 2 min )
    Temporal Graph Representation Learning with Adaptive Augmentation Contrastive. (arXiv:2311.03897v1 [cs.LG])
    Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contrastive learning has shown promise in dealing with noise, it only applies to static graphs or snapshots and may not be suitable for handling time-dependent noise. To alleviate the above challenge, we propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model. The adaptive augmentation on the temporal graph is made by combining prior knowledge with temporal information, and the contrastive objective function is constructed by defining the augmented inter-view contrast and intra-view contrast. To complement TGAC, we propose three adaptive augmentation strategies that modify topological features to reduce noise from the network. Our extensive experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods.  ( 2 min )
    The NeurIPS 2022 Neural MMO Challenge: A Massively Multiagent Competition with Specialization and Trade. (arXiv:2311.03707v1 [cs.AI])
    In this paper, we present the results of the NeurIPS-2022 Neural MMO Challenge, which attracted 500 participants and received over 1,600 submissions. Like the previous IJCAI-2022 Neural MMO Challenge, it involved agents from 16 populations surviving in procedurally generated worlds by collecting resources and defeating opponents. This year's competition runs on the latest v1.6 Neural MMO, which introduces new equipment, combat, trading, and a better scoring system. These elements combine to pose additional robustness and generalization challenges not present in previous competitions. This paper summarizes the design and results of the challenge, explores the potential of this environment as a benchmark for learning methods, and presents some practical reinforcement learning training approaches for complex tasks with sparse rewards. Additionally, we have open-sourced our baselines, including environment wrappers, benchmarks, and visualization tools for future research.  ( 2 min )
    Structural Causal Models Reveal Confounder Bias in Linear Program Modelling. (arXiv:2105.12697v6 [cs.LG] UPDATED)
    The recent years have been marked by extended research on adversarial attacks, especially on deep neural networks. With this work we intend on posing and investigating the question of whether the phenomenon might be more general in nature, that is, adversarial-style attacks outside classical classification tasks. Specifically, we investigate optimization problems as they constitute a fundamental part of modern AI research. To this end, we consider the base class of optimizers namely Linear Programs (LPs). On our initial attempt of a na\"ive mapping between the formalism of adversarial examples and LPs, we quickly identify the key ingredients missing for making sense of a reasonable notion of adversarial examples for LPs. Intriguingly, the formalism of Pearl's notion to causality allows for the right description of adversarial like examples for LPs. Characteristically, we show the direct influence of the Structural Causal Model (SCM) onto the subsequent LP optimization, which ultimately exposes a notion of confounding in LPs (inherited by said SCM) that allows for adversarial-style attacks. We provide both the general proof formally alongside existential proofs of such intriguing LP-parameterizations based on SCM for three combinatorial problems, namely Linear Assignment, Shortest Path and a real world problem of energy systems.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v3 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.  ( 2 min )
    Preventing Arbitrarily High Confidence on Far-Away Data in Point-Estimated Discriminative Neural Networks. (arXiv:2311.03683v1 [cs.LG])
    Discriminatively trained, deterministic neural networks are the de facto choice for classification problems. However, even though they achieve state-of-the-art results on in-domain test sets, they tend to be overconfident on out-of-distribution (OOD) data. For instance, ReLU networks -- a popular class of neural network architectures -- have been shown to almost always yield high confidence predictions when the test data are far away from the training set, even when they are trained with OOD data. We overcome this problem by adding a term to the output of the neural network that corresponds to the logit of an extra class, that we design to dominate the logits of the original classes as we move away from the training data.This technique provably prevents arbitrarily high confidence on far-away test data while maintaining a simple discriminative point-estimate training. Evaluation on various benchmarks demonstrates strong performance against competitive baselines on both far-away and realistic OOD data.  ( 2 min )
    Federated Learning for Clinical Structured Data: A Benchmark Comparison of Engineering and Statistical Approaches. (arXiv:2311.03417v1 [cs.LG])
    Federated learning (FL) has shown promising potential in safeguarding data privacy in healthcare collaborations. While the term "FL" was originally coined by the engineering community, the statistical field has also explored similar privacy-preserving algorithms. Statistical FL algorithms, however, remain considerably less recognized than their engineering counterparts. Our goal was to bridge the gap by presenting the first comprehensive comparison of FL frameworks from both engineering and statistical domains. We evaluated five FL frameworks using both simulated and real-world data. The results indicate that statistical FL algorithms yield less biased point estimates for model coefficients and offer convenient confidence interval estimations. In contrast, engineering-based methods tend to generate more accurate predictions, sometimes surpassing central pooled and statistical FL models. This study underscores the relative strengths and weaknesses of both types of methods, emphasizing the need for increased awareness and their integration in future FL applications.  ( 2 min )
    Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data. (arXiv:2311.03520v1 [cs.LG])
    Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence.  ( 3 min )
    Enhanced physics-informed neural networks with domain scaling and residual correction methods for multi-frequency elliptic problems. (arXiv:2311.03746v1 [math.NA])
    In this paper, neural network approximation methods are developed for elliptic partial differential equations with multi-frequency solutions. Neural network work approximation methods have advantages over classical approaches in that they can be applied without much concerns on the form of the differential equations or the shape or dimension of the problem domain. When applied to problems with multi-frequency solutions, the performance and accuracy of neural network approximation methods are strongly affected by the contrast of the high- and low-frequency parts in the solutions. To address this issue, domain scaling and residual correction methods are proposed. The efficiency and accuracy of the proposed methods are demonstrated for multi-frequency model problems.  ( 2 min )
    A Novel Variational Lower Bound for Inverse Reinforcement Learning. (arXiv:2311.03698v1 [cs.LG])
    Inverse reinforcement learning (IRL) seeks to learn the reward function from expert trajectories, to understand the task for imitation or collaboration thereby removing the need for manual reward engineering. However, IRL in the context of large, high-dimensional problems with unknown dynamics has been particularly challenging. In this paper, we present a new Variational Lower Bound for IRL (VLB-IRL), which is derived under the framework of a probabilistic graphical model with an optimality node. Our method simultaneously learns the reward function and policy under the learned reward function by maximizing the lower bound, which is equivalent to minimizing the reverse Kullback-Leibler divergence between an approximated distribution of optimality given the reward function and the true distribution of optimality given trajectories. This leads to a new IRL method that learns a valid reward function such that the policy under the learned reward achieves expert-level performance on several known domains. Importantly, the method outperforms the existing state-of-the-art IRL algorithms on these domains by demonstrating better reward from the learned policy.  ( 2 min )
    Enhancing AI Research Paper Analysis: Methodology Component Extraction using Factored Transformer-based Sequence Modeling Approach. (arXiv:2311.03401v1 [cs.IR])
    Research in scientific disciplines evolves, often rapidly, over time with the emergence of novel methodologies and their associated terminologies. While methodologies themselves being conceptual in nature and rather difficult to automatically extract and characterise, in this paper, we seek to develop supervised models for automatic extraction of the names of the various constituents of a methodology, e.g., `R-CNN', `ELMo' etc. The main research challenge for this task is effectively modeling the contexts around these methodology component names in a few-shot or even a zero-shot setting. The main contributions of this paper towards effectively identifying new evolving scientific methodology names are as follows: i) we propose a factored approach to sequence modeling, which leverages a broad-level category information of methodology domains, e.g., `NLP', `RL' etc.; ii) to demonstrate the feasibility of our proposed approach of identifying methodology component names under a practical setting of fast evolving AI literature, we conduct experiments following a simulated chronological setup (newer methodologies not seen during the training process); iii) our experiments demonstrate that the factored approach outperforms state-of-the-art baselines by margins of up to 9.257\% for the methodology extraction task with the few-shot setup.  ( 2 min )
    Multimodal deep representation learning for quantum cross-platform verification. (arXiv:2311.03713v1 [quant-ph])
    Cross-platform verification, a critical undertaking in the realm of early-stage quantum computing, endeavors to characterize the similarity of two imperfect quantum devices executing identical algorithms, utilizing minimal measurements. While the random measurement approach has been instrumental in this context, the quasi-exponential computational demand with increasing qubit count hurdles its feasibility in large-qubit scenarios. To bridge this knowledge gap, here we introduce an innovative multimodal learning approach, recognizing that the formalism of data in this task embodies two distinct modalities: measurement outcomes and classical description of compiled circuits on explored quantum devices, both enriched with unique information. Building upon this insight, we devise a multimodal neural network to independently extract knowledge from these modalities, followed by a fusion operation to create a comprehensive data representation. The learned representation can effectively characterize the similarity between the explored quantum devices when executing new quantum algorithms not present in the training data. We evaluate our proposal on platforms featuring diverse noise models, encompassing system sizes up to 50 qubits. The achieved results demonstrate a three-orders-of-magnitude improvement in prediction accuracy compared to the random measurements and offer compelling evidence of the complementary roles played by each modality in cross-platform verification. These findings pave the way for harnessing the power of multimodal learning to overcome challenges in wider quantum system learning tasks.  ( 2 min )
    GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values. (arXiv:2311.03426v1 [cs.LG])
    Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.  ( 2 min )
    Leveraging High-Level Synthesis and Large Language Models to Generate, Simulate, and Deploy a Uniform Random Number Generator Hardware Design. (arXiv:2311.03489v1 [cs.AR])
    We present a new high-level synthesis methodology for using large language model tools to generate hardware designs. The methodology uses exclusively open-source tools excluding the large language model. As a case study, we use our methodology to generate a permuted congruential random number generator design with a wishbone interface. We verify the functionality and quality of the random number generator design using large language model-generated simulations and the Dieharder randomness test suite. We document all the large language model chat logs, Python scripts, Verilog scripts, and simulation results used in the case study. We believe that our method of hardware design generation coupled with the open source silicon 130 nm design tools will revolutionize application-specific integrated circuit design. Our methodology significantly lowers the bar to entry when building domain-specific computing accelerators for the Internet of Things and proof of concept prototypes for later fabrication in more modern process nodes.  ( 2 min )
    Testing RadiX-Nets: Advances in Viable Sparse Topologies. (arXiv:2311.03609v1 [cs.LG])
    The exponential growth of data has sparked computational demands on ML research and industry use. Sparsification of hyper-parametrized deep neural networks (DNNs) creates simpler representations of complex data. Past research has shown that some sparse networks achieve similar performance as dense ones, reducing runtime and storage. RadiX-Nets, a subgroup of sparse DNNs, maintain uniformity which counteracts their lack of neural connections. Generation, independent of a dense network, yields faster asymptotic training and removes the need for costly pruning. However, little work has been done on RadiX-Nets, making testing challenging. This paper presents a testing suite for RadiX-Nets in TensorFlow. We test RadiX-Net performance to streamline processing in scalable models, revealing relationships between network topology, initialization, and training behavior. We also encounter "strange models" that train inconsistently and to lower accuracy while models of similar sparsity train well.  ( 2 min )
    Communication Efficient and Privacy-Preserving Federated Learning Based on Evolution Strategies. (arXiv:2311.03405v1 [cs.LG])
    Federated learning (FL) is an emerging paradigm for training deep neural networks (DNNs) in distributed manners. Current FL approaches all suffer from high communication overhead and information leakage. In this work, we present a federated learning algorithm based on evolution strategies (FedES), a zeroth-order training method. Instead of transmitting model parameters, FedES only communicates loss values, and thus has very low communication overhead. Moreover, a third party is unable to estimate gradients without knowing the pre-shared seed, which protects data privacy. Experimental results demonstrate FedES can achieve the above benefits while keeping convergence performance the same as that with back propagation methods.  ( 2 min )
    A Generative Neural Network Approach for 3D Multi-Criteria Design Generation and Optimization of an Engine Mount for an Unmanned Air Vehicle. (arXiv:2311.03414v1 [cs.LG])
    One of the most promising developments in computer vision in recent years is the use of generative neural networks for functionality condition-based 3D design reconstruction and generation. Here, neural networks learn dependencies between functionalities and a geometry in a very effective way. For a neural network the functionalities are translated in conditions to a certain geometry. But the more conditions the design generation needs to reflect, the more difficult it is to learn clear dependencies. This leads to a multi criteria design problem due various conditions, which are not considered in the neural network structure so far. In this paper, we address this multi-criteria challenge for a 3D design use case related to an unmanned aerial vehicle (UAV) motor mount. We generate 10,000 abstract 3D designs and subject them all to simulations for three physical disciplines: mechanics, thermodynamics, and aerodynamics. Then, we train a Conditional Variational Autoencoder (CVAE) using the geometry and corresponding multicriteria functional constraints as input. We use our trained CVAE as well as the Marching cubes algorithm to generate meshes for simulation based evaluation. The results are then evaluated with the generated UAV designs. Subsequently, we demonstrate the ability to generate optimized designs under self-defined functionality conditions using the trained neural network.  ( 3 min )
    Blocked Collaborative Bandits: Online Collaborative Filtering with Per-Item Budget Constraints. (arXiv:2311.03376v1 [cs.IR])
    We consider the problem of \emph{blocked} collaborative bandits where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. Our goal is to design algorithms that maximize the cumulative reward accrued by all the users over time, under the \emph{constraint} that no arm of a user is pulled more than $\mathsf{B}$ times. This problem has been originally considered by \cite{Bresler:2014}, and designing regret-optimal algorithms for it has since remained an open problem. In this work, we propose an algorithm called \texttt{B-LATTICE} (Blocked Latent bAndiTs via maTrIx ComplEtion) that collaborates across users, while simultaneously satisfying the budget constraints, to maximize their cumulative rewards. Theoretically, under certain reasonable assumptions on the latent structure, with $\mathsf{M}$ users, $\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent clusters, \texttt{B-LATTICE} achieves a per-user regret of $\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})}$ under a budget constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first sub-linear regret bounds for this problem, and match the minimax regret bounds when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm has superior performance over baselines even when $\mathsf{B}=1$. \texttt{B-LATTICE} runs in phases where in each phase it clusters users into groups and collaborates across users within a group to quickly learn their reward models.  ( 2 min )
    Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems. (arXiv:2311.03488v1 [cs.IR])
    While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Model (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@$n$ and 4.65% in NDCG@$n$.  ( 2 min )
    Attention-based Models for Snow-Water Equivalent Prediction. (arXiv:2311.03388v1 [cs.LG])
    Snow Water-Equivalent (SWE) -- the amount of water available if snowpack is melted -- is a key decision variable used by water management agencies to make irrigation, flood control, power generation and drought management decisions. SWE values vary spatiotemporally -- affected by weather, topography and other environmental factors. While daily SWE can be measured by Snow Telemetry (SNOTEL) stations with requisite instrumentation, such stations are spatially sparse requiring interpolation techniques to create spatiotemporally complete data. While recent efforts have explored machine learning (ML) for SWE prediction, a number of recent ML advances have yet to be considered. The main contribution of this paper is to explore one such ML advance, attention mechanisms, for SWE prediction. Our hypothesis is that attention has a unique ability to capture and exploit correlations that may exist across locations or the temporal spectrum (or both). We present a generic attention-based modeling framework for SWE prediction and adapt it to capture spatial attention and temporal attention. Our experimental results on 323 SNOTEL stations in the Western U.S. demonstrate that our attention-based models outperform other machine learning approaches. We also provide key results highlighting the differences between spatial and temporal attention in this context and a roadmap toward deployment for generating spatially-complete SWE maps.  ( 2 min )
    InterVLS: Interactive Model Understanding and Improvement with Vision-Language Surrogates. (arXiv:2311.03547v1 [cs.AI])
    Deep learning models are widely used in critical applications, highlighting the need for pre-deployment model understanding and improvement. Visual concept-based methods, while increasingly used for this purpose, face challenges: (1) most concepts lack interpretability, (2) existing methods require model knowledge, often unavailable at run time. Additionally, (3) there lacks a no-code method for post-understanding model improvement. Addressing these, we present InterVLS. The system facilitates model understanding by discovering text-aligned concepts, measuring their influence with model-agnostic linear surrogates. Employing visual analytics, InterVLS offers concept-based explanations and performance insights. It enables users to adjust concept influences to update a model, facilitating no-code model improvement. We evaluate InterVLS in a user study, illustrating its functionality with two scenarios. Results indicates that InterVLS is effective to help users identify influential concepts to a model, gain insights and adjust concept influence to improve the model. We conclude with a discussion based on our study results.  ( 2 min )
    Exploring Latent Spaces of Tonal Music using Variational Autoencoders. (arXiv:2311.03621v1 [cs.SD])
    Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value. We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales define latent spaces representative of the circle of fifths and the hierarchical relation of each key component pitch as drawn in music cognition. In detail, we compare the latent space of different VAE corpus encodings -- Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions -- in providing a pitch space for key relations that align with cognitive distances. We evaluate the model performance of these encodings using objective metrics to capture accuracy, mean square error (MSE), KL-divergence, and computational cost. The ABC encoding performs the best in reconstructing the original data, while the Pitch DFT seems to capture more information from the latent space. Furthermore, an objective evaluation of 12 major or minor transpositions per piece is adopted to quantify the alignment of 1) intra- and inter-segment distances per key and 2) the key distances to cognitive pitch spaces. Our results show that Pitch DFT VAE latent spaces align best with cognitive spaces and provide a common-tone space where overlapping objects within a key are fuzzy clusters, which impose a well-defined order of structural significance or stability -- i.e., a tonal hierarchy. Tonal hierarchies of different keys can be used to measure key distances and the relationships of their in-key components at multiple hierarchies (e.g., notes and chords). The implementation of our VAE and the encodings framework are made available online.  ( 3 min )
    FinA: Fairness of Adverse Effects in Decision-Making of Human-Cyber-Physical-System. (arXiv:2311.03468v1 [cs.AI])
    Ensuring fairness in decision-making systems within Human-Cyber-Physical-Systems (HCPS) is a pressing concern, particularly when diverse individuals, each with varying behaviors and expectations, coexist within the same application space, influenced by a shared set of control actions in the system. The long-term adverse effects of these actions further pose the challenge, as historical experiences and interactions shape individual perceptions of fairness. This paper addresses the challenge of fairness from an equity perspective of adverse effects, taking into account the dynamic nature of human behavior and evolving preferences while recognizing the lasting impact of adverse effects. We formally introduce the concept of Fairness-in-Adverse-Effects (FinA) within the HCPS context. We put forth a comprehensive set of five formulations for FinA, encompassing both the instantaneous and long-term aspects of adverse effects. To empirically validate the effectiveness of our FinA approach, we conducted an evaluation within the domain of smart homes, a pertinent HCPS application. The outcomes of our evaluation demonstrate that the adoption of FinA significantly enhances the overall perception of fairness among individuals, yielding an average improvement of 66.7% when compared to the state-of-the-art method.  ( 2 min )
    Discret2Di -- Deep Learning based Discretization for Model-based Diagnosis. (arXiv:2311.03413v1 [cs.LG])
    Consistency-based diagnosis is an established approach to diagnose technical applications, but suffers from significant modeling efforts, especially for dynamic multi-modal time series. Machine learning seems to be an obvious solution, which becomes less obvious when looking at details: Which notion of consistency can be used? If logical calculi are still to be used, how can dynamic time series be transferred into the discrete world? This paper presents the methodology Discret2Di for automated learning of logical expressions for consistency-based diagnosis. While these logical calculi have advantages by providing a clear notion of consistency, they have the key problem of relying on a discretization of the dynamic system. The solution presented combines machine learning from both the time series and the symbolic domain to automate the learning of logical rules for consistency-based diagnosis.  ( 2 min )
    Low-Rank MDPs with Continuous Action Spaces. (arXiv:2311.03564v1 [cs.LG])
    Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.  ( 2 min )
    DP-DCAN: Differentially Private Deep Contrastive Autoencoder Network for Single-cell Clustering. (arXiv:2311.03410v1 [cs.LG])
    Single-cell RNA sequencing (scRNA-seq) is important to transcriptomic analysis of gene expression. Recently, deep learning has facilitated the analysis of high-dimensional single-cell data. Unfortunately, deep learning models may leak sensitive information about users. As a result, Differential Privacy (DP) is increasingly used to protect privacy. However, existing DP methods usually perturb whole neural networks to achieve differential privacy, and hence result in great performance overheads. To address this challenge, in this paper, we take advantage of the uniqueness of the autoencoder that it outputs only the dimension-reduced vector in the middle of the network, and design a Differentially Private Deep Contrastive Autoencoder Network (DP-DCAN) by partial network perturbation for single-cell clustering. Since only partial network is added with noise, the performance improvement is obvious and twofold: one part of network is trained with less noise due to a bigger privacy budget, and the other part is trained without any noise. Experimental results of six datasets have verified that DP-DCAN is superior to the traditional DP scheme with whole network perturbation. Moreover, DP-DCAN demonstrates strong robustness to adversarial attacks. The code is available at https://github.com/LFD-byte/DP-DCAN.  ( 2 min )
    CAFE: Carbon-Aware Federated Learning in Geographically Distributed Data Centers. (arXiv:2311.03615v1 [cs.LG])
    Training large-scale artificial intelligence (AI) models demands significant computational power and energy, leading to increased carbon footprint with potential environmental repercussions. This paper delves into the challenges of training AI models across geographically distributed (geo-distributed) data centers, emphasizing the balance between learning performance and carbon footprint. We consider Federated Learning (FL) as a solution, which prioritizes model parameter exchange over raw data, ensuring data privacy and compliance with local regulations. Given the variability in carbon intensity across regions, we propose a new framework called CAFE (short for Carbon-Aware Federated Learning) to optimize training within a fixed carbon footprint budget. Our approach incorporates coreset selection to assess learning performance, employs the Lyapunov drift-plus-penalty framework to address the unpredictability of future carbon intensity, and devises an efficient algorithm to address the combinatorial complexity of the data center selection. Through extensive simulations using real-world carbon intensity data, we demonstrate the efficacy of our algorithm, highlighting its superiority over existing methods in optimizing learning performance while minimizing environmental impact.  ( 2 min )
    An AI-Guided Data Centric Strategy to Detect and Mitigate Biases in Healthcare Datasets. (arXiv:2311.03425v1 [cs.LG])
    The adoption of diagnosis and prognostic algorithms in healthcare has led to concerns about the perpetuation of bias against disadvantaged groups of individuals. Deep learning methods to detect and mitigate bias have revolved around modifying models, optimization strategies, and threshold calibration with varying levels of success. Here, we generate a data-centric, model-agnostic, task-agnostic approach to evaluate dataset bias by investigating the relationship between how easily different groups are learned at small sample sizes (AEquity). We then apply a systematic analysis of AEq values across subpopulations to identify and mitigate manifestations of racial bias in two known cases in healthcare - Chest X-rays diagnosis with deep convolutional neural networks and healthcare utilization prediction with multivariate logistic regression. AEq is a novel and broadly applicable metric that can be applied to advance equity by diagnosing and remediating bias in healthcare datasets.  ( 2 min )
    ViDa: Visualizing DNA hybridization trajectories with biophysics-informed deep graph embeddings. (arXiv:2311.03411v1 [q-bio.QM])
    Visualization tools can help synthetic biologists and molecular programmers understand the complex reactive pathways of nucleic acid reactions, which can be designed for many potential applications and can be modelled using a continuous-time Markov chain (CTMC). Here we present ViDa, a new visualization approach for DNA reaction trajectories that uses a 2D embedding of the secondary structure state space underlying the CTMC model. To this end, we integrate a scattering transform of the secondary structure adjacency, a variational autoencoder, and a nonlinear dimensionality reduction method. We augment the training loss with domain-specific supervised terms that capture both thermodynamic and kinetic features. We assess ViDa on two well-studied DNA hybridization reactions. Our results demonstrate that the domain-specific features lead to significant quality improvements over the state-of-the-art in DNA state space visualization, successfully separating different folding pathways and thus providing useful insights into dominant reaction mechanisms.  ( 2 min )
    An attempt to generate new bridge types from latent space of variational autoencoder. (arXiv:2311.03380v1 [cs.LG])
    Try to generate new bridge types using generative artificial intelligence technology. The grayscale images of the bridge facade with the change of component width was rendered by 3dsMax animation software, and then the OpenCV module performed an appropriate amount of geometric transformation (rotation, horizontal scale, vertical scale) to obtain the image dataset of three-span beam bridge, arch bridge, cable-stayed bridge and suspension bridge. Based on Python programming language, TensorFlow and Keras deep learning platform framework, variational autoencoder was constructed and trained, and low-dimensional bridge-type latent space that is convenient for vector operations was obtained. Variational autoencoder can combine two bridge types on the basis of the original of human into one that is a new bridge type. Generative artificial intelligence technology can assist bridge designers in bridge-type innovation, and can be used as copilot.  ( 2 min )
  • Open

    Bandit Pareto Set Identification: the Fixed Budget Setting. (arXiv:2311.03992v1 [stat.ML])
    We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms.  ( 2 min )
    Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization. (arXiv:2311.04163v1 [cs.LG])
    We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highlight connections to other concepts in optimization and generalization including grokking, simplicity bias, and Sharpness-Aware Minimization. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals: consistent, large magnitude features which dominate the network output throughout training and provide gradients which point in opposite directions. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We describe how to identify these groups, explore what sets them apart, and carefully study their effect on the network's optimization and behavior. We complement these experiments with a mechanistic explanation on a toy example of opposing signals and a theoretical analysis of a two-layer linear network on a simple model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD.  ( 3 min )
    AdaSub: Stochastic Optimization Using Second-Order Information in Low-Dimensional Subspaces. (arXiv:2310.20060v2 [math.OC] UPDATED)
    We introduce AdaSub, a stochastic optimization algorithm that computes a search direction based on second-order information in a low-dimensional subspace that is defined adaptively based on available current and past information. Compared to first-order methods, second-order methods exhibit better convergence characteristics, but the need to compute the Hessian matrix at each iteration results in excessive computational expenses, making them impractical. To address this issue, our approach enables the management of computational expenses and algorithm efficiency by enabling the selection of the subspace dimension for the search. Our code is freely available on GitHub, and our preliminary numerical results demonstrate that AdaSub surpasses popular stochastic optimizers in terms of time and number of iterations required to reach a given accuracy.  ( 2 min )
    Flow-based distributionally robust optimization. (arXiv:2310.19253v2 [cs.LG] UPDATED)
    We present a computationally efficient framework, called FlowDRO, for solving flow-based distributionally robust optimization (DRO) problems with Wasserstein uncertainty sets while aiming to find continuous worst-case distribution (also called the Least Favorable Distribution, LFD). The requirement for LFD to be continuous is so that the algorithm can be scalable to problems with larger sample sizes and achieve better generalization capability for the induced robust algorithms. To tackle the computationally challenging infinitely dimensional optimization problem, we leverage flow-based models and continuous-time invertible transport maps between the data distribution and the target distribution. We also develop a Wasserstein proximal gradient flow type of algorithm. In theory, we establish the equivalence of the solution by optimal transport map to the original formulation, as well as the dual form of the problem through Wasserstein calculus and Brenier theorem. In practice, we parameterize the transport maps by a sequence of neural networks progressively trained in blocks by gradient descent. Our computational framework is general, can handle high-dimensional data with large sample sizes, and can be useful for various applications. We demonstrate its usage in adversarial learning, distributionally robust hypothesis testing, and a new mechanism for data-driven distribution perturbation differential privacy, where the proposed method gives strong empirical performance on real high-dimensional data.  ( 2 min )
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v3 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.  ( 2 min )
    A Corrected Expected Improvement Acquisition Function Under Noisy Observations. (arXiv:2310.05166v2 [cs.LG] UPDATED)
    Sequential maximization of expected improvement (EI) is one of the most widely used policies in Bayesian optimization because of its simplicity and ability to handle noisy observations. In particular, the improvement function often uses the best posterior mean as the best incumbent in noisy settings. However, the uncertainty associated with the incumbent solution is often neglected in many analytic EI-type methods: a closed-form acquisition function is derived in the noise-free setting, but then applied to the setting with noisy observations. To address this limitation, we propose a modification of EI that corrects its closed-form expression by incorporating the covariance information provided by the Gaussian Process (GP) model. This acquisition function specializes to the classical noise-free result, and we argue should replace that formula in Bayesian optimization software packages, tutorials, and textbooks. This enhanced acquisition provides good generality for noisy and noiseless settings. We show that our method achieves a sublinear convergence rate on the cumulative regret bound under heteroscedastic observation noise. Our empirical results demonstrate that our proposed acquisition function can outperform EI in the presence of noisy observations on benchmark functions for black-box optimization, as well as on parameter search for neural network model compression.  ( 2 min )
    Learning-Based Optimal Control with Performance Guarantees for Unknown Systems with Latent States. (arXiv:2303.17963v2 [eess.SY] UPDATED)
    As control engineering methods are applied to increasingly complex systems, data-driven approaches for system identification appear as a promising alternative to physics-based modeling. While the Bayesian approaches prevalent for safety-critical applications usually rely on the availability of state measurements, the states of a complex system are often not directly measurable. It may then be necessary to jointly estimate the dynamics and the latent state, making the quantification of uncertainties and the design of controllers with formal performance guarantees considerably more challenging. This paper proposes a novel method for the computation of an optimal input trajectory for unknown nonlinear systems with latent states based on a combination of particle Markov chain Monte Carlo methods and scenario theory. Probabilistic performance guarantees are derived for the resulting input trajectory, and an approach to validate the performance of arbitrary control laws is presented. The effectiveness of the proposed method is demonstrated in a numerical simulation.  ( 2 min )
    Posterior Sampling-Based Bayesian Optimization with Tighter Bayesian Regret Bounds. (arXiv:2311.03760v1 [cs.LG])
    Among various acquisition functions (AFs) in Bayesian optimization (BO), Gaussian process upper confidence bound (GP-UCB) and Thompson sampling (TS) are well-known options with established theoretical properties regarding Bayesian cumulative regret (BCR). Recently, it has been shown that a randomized variant of GP-UCB achieves a tighter BCR bound compared with GP-UCB, which we call the tighter BCR bound for brevity. Inspired by this study, this paper first shows that TS achieves the tighter BCR bound. On the other hand, GP-UCB and TS often practically suffer from manual hyperparameter tuning and over-exploration issues, respectively. To overcome these difficulties, we propose yet another AF called a probability of improvement from the maximum of a sample path (PIMS). We show that PIMS achieves the tighter BCR bound and avoids the hyperparameter tuning, unlike GP-UCB. Furthermore, we demonstrate a wide range of experiments, focusing on the effectiveness of PIMS that mitigates the practical issues of GP-UCB and TS.  ( 2 min )
    Inference via robust optimal transportation: theory and methods. (arXiv:2301.06297v2 [math.ST] UPDATED)
    Optimal transport (OT) theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are widely-applied in statistics and machine learning. In spite of their popularity, inference based on these tools is sensitive to outliers or it can perform poorly when the underlying model has heavy-tails. To cope with these issues, we introduce a new class of procedures. (i) We consider a robust version of the primal OT problem (ROBOT) and show that it defines the {robust Wasserstein distance}, $W^{(\lambda)}$, which depends on a tuning parameter $\lambda > 0$. (ii) We illustrate the link between $W_1$ and $W^{(\lambda)}$ and study its key measure theoretic aspects. (iii) We derive some concentration inequalities for $W^{(\lambda)}$. (iii) We use $W^{(\lambda)}$ to define minimum distance estimators, we provide their statistical guarantees and we illustrate how to apply concentration inequalities for the selection of $\lambda$. (v) We derive the {dual} form of the ROBOT and illustrate its applicability to machine learning problems (generative adversarial networks and domain adaptation). Numerical exercises provide evidence of the benefits yielded by our methods.  ( 2 min )
    Learning Proposals for Practical Energy-Based Regression. (arXiv:2110.11948v2 [cs.LG] UPDATED)
    Energy-based models (EBMs) have experienced a resurgence within machine learning in recent years, including as a promising alternative for probabilistic regression. However, energy-based regression requires a proposal distribution to be manually designed for training, and an initial estimate has to be provided at test-time. We address both of these issues by introducing a conceptually simple method to automatically learn an effective proposal distribution, which is parameterized by a separate network head. To this end, we derive a surprising result, leading to a unified training objective that jointly minimizes the KL divergence from the proposal to the EBM, and the negative log-likelihood of the EBM. At test-time, we can then employ importance sampling with the trained proposal to efficiently evaluate the learned EBM and produce stand-alone predictions. Furthermore, we utilize our derived training objective to learn mixture density networks (MDNs) with a jointly trained energy-based teacher, consistently outperforming conventional MDN training on four real-world regression tasks within computer vision. Code is available at https://github.com/fregu856/ebms_proposals.  ( 2 min )
    Maximum Mean Discrepancy Meets Neural Networks: The Radon-Kolmogorov-Smirnov Test. (arXiv:2309.02422v3 [stat.ML] UPDATED)
    Maximum mean discrepancy (MMD) refers to a general class of nonparametric two-sample tests that are based on maximizing the mean difference over samples from one distribution $P$ versus another $Q$, over all choices of data transformations $f$ living in some function space $\mathcal{F}$. Inspired by recent work that connects what are known as functions of $\textit{Radon bounded variation}$ (RBV) and neural networks (Parhi and Nowak, 2021, 2023), we study the MMD defined by taking $\mathcal{F}$ to be the unit ball in the RBV space of a given smoothness order $k \geq 0$. This test, which we refer to as the $\textit{Radon-Kolmogorov-Smirnov}$ (RKS) test, can be viewed as a generalization of the well-known and classical Kolmogorov-Smirnov (KS) test to multiple dimensions and higher orders of smoothness. It is also intimately connected to neural networks: we prove that the witness in the RKS test -- the function $f$ achieving the maximum mean difference -- is always a ridge spline of degree $k$, i.e., a single neuron in a neural network. This allows us to leverage the power of modern deep learning toolkits to (approximately) optimize the criterion that underlies the RKS test. We prove that the RKS test has asymptotically full power at distinguishing any distinct pair $P \not= Q$ of distributions, derive its asymptotic null distribution, and carry out extensive experiments to elucidate the strengths and weakenesses of the RKS test versus the more traditional kernel MMD test.  ( 3 min )
    Dynamics of Finite Width Kernel and Prediction Fluctuations in Mean Field Neural Networks. (arXiv:2304.03408v3 [stat.ML] UPDATED)
    We analyze the dynamics of finite width effects in wide but finite feature learning neural networks. Starting from a dynamical mean field theory description of infinite width deep neural network kernel and prediction dynamics, we provide a characterization of the $O(1/\sqrt{\text{width}})$ fluctuations of the DMFT order parameters over random initializations of the network weights. Our results, while perturbative in width, unlike prior analyses, are non-perturbative in the strength of feature learning. In the lazy limit of network training, all kernels are random but static in time and the prediction variance has a universal form. However, in the rich, feature learning regime, the fluctuations of the kernels and predictions are dynamically coupled with a variance that can be computed self-consistently. In two layer networks, we show how feature learning can dynamically reduce the variance of the final tangent kernel and final network predictions. We also show how initialization variance can slow down online learning in wide but finite networks. In deeper networks, kernel variance can dramatically accumulate through subsequent layers at large feature learning strengths, but feature learning continues to improve the signal-to-noise ratio of the feature kernels. In discrete time, we demonstrate that large learning rate phenomena such as edge of stability effects can be well captured by infinite width dynamics and that initialization variance can decrease dynamically. For CNNs trained on CIFAR-10, we empirically find significant corrections to both the bias and variance of network dynamics due to finite width.  ( 3 min )
    Convergence Analysis of Mean Shift. (arXiv:2305.08463v3 [stat.ML] UPDATED)
    The mean shift (MS) algorithm seeks a mode of the kernel density estimate (KDE). This study presents a convergence guarantee of the mode estimate sequence generated by the MS algorithm and an evaluation of the convergence rate, under fairly mild conditions, with the help of the argument concerning the {\L}ojasiewicz inequality. Our findings extend existing ones covering analytic kernels and the Epanechnikov kernel. Those are significant in that they cover the biweight kernel, which is optimal among non-negative kernels in terms of the asymptotic statistical efficiency for the KDE-based mode estimation.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v3 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 3 min )
    CeCNN: Copula-enhanced convolutional neural networks in joint prediction of refraction error and axial length based on ultra-widefield fundus images. (arXiv:2311.03967v1 [cs.CV])
    Ultra-widefield (UWF) fundus images are replacing traditional fundus images in screening, detection, prediction, and treatment of complications related to myopia because their much broader visual range is advantageous for highly myopic eyes. Spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. Cutting-edge studies show that SE and AL are strongly correlated. Using the joint information from SE and AL is potentially better than using either separately. In the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. Inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of multivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. Specifically, we propose a copula-enhanced convolutional neural network (CeCNN) framework that incorporates the dependence between responses through a Gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. We establish the statistical framework and algorithms for the aforementioned two bivariate tasks. We show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. The modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.  ( 3 min )
    Birth of a Transformer: A Memory Viewpoint. (arXiv:2306.00802v2 [stat.ML] UPDATED)
    Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.  ( 2 min )
    Accurate 3D Object Detection using Energy-Based Models. (arXiv:2012.04634v2 [cs.CV] UPDATED)
    Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at https://github.com/fregu856/ebms_3dod.  ( 2 min )
    Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach. (arXiv:2305.17058v3 [cs.PL] UPDATED)
    We present an exact Bayesian inference method for discrete statistical models, which can find exact solutions to a large class of discrete inference problems, even with infinite support and continuous priors. To express such models, we introduce a probabilistic programming language that supports discrete and continuous sampling, discrete observations, affine functions, (stochastic) branching, and conditioning on discrete events. Our key tool is probability generating functions: they provide a compact closed-form representation of distributions that are definable by programs, thus enabling the exact computation of posterior probabilities, expectation, variance, and higher moments. Our inference method is provably correct and fully automated in a tool called Genfer, which uses automatic differentiation (specifically, Taylor polynomials), but does not require computer algebra. Our experiments show that Genfer is often faster than the existing exact inference tools PSI, Dice, and Prodigy. On a range of real-world inference problems that none of these exact tools can solve, Genfer's performance is competitive with approximate Monte Carlo methods, while avoiding approximation errors.  ( 2 min )
    Computing Approximate $\ell_p$ Sensitivities. (arXiv:2311.04158v1 [cs.LG])
    Recent works in dimensionality reduction for regression tasks have introduced the notion of sensitivity, an estimate of the importance of a specific datapoint in a dataset, offering provable guarantees on the quality of the approximation after removing low-sensitivity datapoints via subsampling. However, fast algorithms for approximating $\ell_p$ sensitivities, which we show is equivalent to approximate $\ell_p$ regression, are known for only the $\ell_2$ setting, in which they are termed leverage scores. In this work, we provide efficient algorithms for approximating $\ell_p$ sensitivities and related summary statistics of a given matrix. In particular, for a given $n \times d$ matrix, we compute $\alpha$-approximation to its $\ell_1$ sensitivities at the cost of $O(n/\alpha)$ sensitivity computations. For estimating the total $\ell_p$ sensitivity (i.e. the sum of $\ell_p$ sensitivities), we provide an algorithm based on importance sampling of $\ell_p$ Lewis weights, which computes a constant factor approximation to the total sensitivity at the cost of roughly $O(\sqrt{d})$ sensitivity computations. Furthermore, we estimate the maximum $\ell_1$ sensitivity, up to a $\sqrt{d}$ factor, using $O(d)$ sensitivity computations. We generalize all these results to $\ell_p$ norms for $p > 1$. Lastly, we experimentally show that for a wide class of matrices in real-world datasets, the total sensitivity can be quickly approximated and is significantly smaller than the theoretical prediction, demonstrating that real-world datasets have low intrinsic effective dimensionality.  ( 2 min )
    Discordance Minimization-based Imputation Algorithms for Missing Values in Rating Data. (arXiv:2311.04035v1 [stat.ML])
    Ratings are frequently used to evaluate and compare subjects in various applications, from education to healthcare, because ratings provide succinct yet credible measures for comparing subjects. However, when multiple rating lists are combined or considered together, subjects often have missing ratings, because most rating lists do not rate every subject in the combined list. In this study, we propose analyses on missing value patterns using six real-world data sets in various applications, as well as the conditions for applicability of imputation algorithms. Based on the special structures and properties derived from the analyses, we propose optimization models and algorithms that minimize the total rating discordance across rating providers to impute missing ratings in the combined rating lists, using only the known rating information. The total rating discordance is defined as the sum of the pairwise discordance metric, which can be written as a quadratic function. Computational experiments based on real-world and synthetic rating data sets show that the proposed methods outperform the state-of-the-art general imputation methods in the literature in terms of imputation accuracy.  ( 2 min )
    Interaction Measures, Partition Lattices and Kernel Tests for High-Order Interactions. (arXiv:2306.00904v3 [stat.ML] UPDATED)
    Models that rely solely on pairwise relationships often fail to capture the complete statistical structure of the complex multivariate data found in diverse domains, such as socio-economic, ecological, or biomedical systems. Non-trivial dependencies between groups of more than two variables can play a significant role in the analysis and modelling of such systems, yet extracting such high-order interactions from data remains challenging. Here, we introduce a hierarchy of $d$-order ($d \geq 2$) interaction measures, increasingly inclusive of possible factorisations of the joint probability distribution, and define non-parametric, kernel-based tests to establish systematically the statistical significance of $d$-order interactions. We also establish mathematical links with lattice theory, which elucidate the derivation of the interaction measures and their composite permutation tests; clarify the connection of simplicial complexes with kernel matrix centring; and provide a means to enhance computational efficiency. We illustrate our results numerically with validations on synthetic data, and through an application to neuroimaging data.  ( 2 min )
    Additive Covariance Matrix Models: Modelling Regional Electricity Net-Demand in Great Britain. (arXiv:2211.07451v2 [stat.AP] UPDATED)
    Forecasts of regional electricity net-demand, consumption minus embedded generation, are an essential input for reliable and economic power system operation, and energy trading. While such forecasts are typically performed region by region, operations such as managing power flows require spatially coherent joint forecasts, which account for cross-regional dependencies. Here, we forecast the joint distribution of net-demand across the 14 regions constituting Great Britain's electricity network. Joint modelling is complicated by the fact that the net-demand variability within each region, and the dependencies between regions, vary with temporal, socio-economical and weather-related factors. We accommodate for these characteristics by proposing a multivariate Gaussian model based on a modified Cholesky parametrisation, which allows us to model each unconstrained parameter via an additive model. Given that the number of model parameters and covariates is large, we adopt a semi-automated approach to model selection, based on gradient boosting. In addition to comparing the forecasting performance of several versions of the proposed model with that of two non-Gaussian copula-based models, we visually explore the model output to interpret how the covariates affect net-demand variability and dependencies. The code for reproducing the results in this paper is available at https://doi.org/10.5281/zenodo.7315105, while methods for building and fitting multivariate Gaussian additive models are provided by the SCM R package, available at https://github.com/VinGioia90/SCM.  ( 2 min )
    On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions. (arXiv:2311.03794v1 [math.OC])
    We study the training dynamics of a shallow neural network with quadratic activation functions and quadratic cost in a teacher-student setup. In line with previous works on the same neural architecture, the optimization is performed following the gradient flow on the population risk, where the average over data points is replaced by the expectation over their distribution, assumed to be Gaussian.We first derive convergence properties for the gradient flow and quantify the overparameterization that is necessary to achieve a strong signal recovery. Then, assuming that the teachers and the students at initialization form independent orthonormal families, we derive a high-dimensional limit for the flow and show that the minimal overparameterization is sufficient for strong recovery. We verify by numerical experiments that these results hold for more general initializations.  ( 2 min )
    Hypergraphs with node attributes: structure and inference. (arXiv:2311.03857v1 [cs.SI])
    Many networked datasets with units interacting in groups of two or more, encoded with hypergraphs, are accompanied by extra information about nodes, such as the role of an individual in a workplace. Here we show how these node attributes can be used to improve our understanding of the structure resulting from higher-order interactions. We consider the problem of community detection in hypergraphs and develop a principled model that combines higher-order interactions and node attributes to better represent the observed interactions and to detect communities more accurately than using either of these types of information alone. The method learns automatically from the input data the extent to which structure and attributes contribute to explain the data, down weighing or discarding attributes if not informative. Our algorithmic implementation is efficient and scales to large hypergraphs and interactions of large numbers of units. We apply our method to a variety of systems, showing strong performance in hyperedge prediction tasks and in selecting community divisions that correlate with attributes when these are informative, but discarding them otherwise. Our approach illustrates the advantage of using informative node attributes when available with higher-order data.  ( 2 min )
    How to Scale Your EMA. (arXiv:2307.13813v3 [stat.ML] UPDATED)
    Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6$\times$ wall-clock time reduction under idealized hardware settings.  ( 3 min )
    Doubly Robust Kernel Statistics for Testing Distributional Treatment Effects. (arXiv:2212.04922v2 [stat.ML] UPDATED)
    With the widespread application of causal inference, it is increasingly important to have tools which can test for the presence of causal effects in a diverse array of circumstances. In this vein we focus on the problem of testing for \emph{distributional} causal effects, where the treatment affects not just the mean, but also higher order moments of the distribution, as well as multidimensional or structured outcomes. We build upon a previously introduced framework, Counterfactual Mean Embeddings, for representing causal distributions within Reproducing Kernel Hilbert Spaces (RKHS) by proposing new, improved, estimators for the distributional embeddings. These improved estimators are inspired by doubly robust estimators of the causal mean, using a similar form within the kernel space. We analyse these estimators, proving they retain the doubly robust property and have improved convergence rates compared to the original estimators. This leads to new permutation based tests for distributional causal effects, using the estimators we propose as tests statistics. We experimentally and theoretically demonstrate the validity of our tests.  ( 2 min )
    Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors. (arXiv:2309.06782v3 [physics.data-an] UPDATED)
    We study scalable machine learning models for full event reconstruction in high-energy electron-positron collisions based on a highly granular detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters or hits. We compare a graph neural network and kernel-based transformer and demonstrate that both avoid quadratic memory allocation and computational cost while achieving realistic reconstruction. We show that hyperparameter tuning on a supercomputer significantly enhances the physics performance of the models, improving the jet transverse momentum resolution by up to 50% compared to the baseline. The resulting model is highly portable across hardware processors. Finally, we demonstrate that the model can be trained on highly granular inputs consisting of tracks and calorimeter hits, resulting in a competitive physics performance with the baseline. Datasets and software to reproduce the studies are published following the findable, accessible, interoperable, and reusable principles.  ( 2 min )
    Comparing Causal Frameworks: Potential Outcomes, Structural Models, Graphs, and Abstractions. (arXiv:2306.14351v2 [stat.ME] UPDATED)
    The aim of this paper is to make clear and precise the relationship between the Rubin causal model (RCM) and structural causal model (SCM) frameworks for causal inference. Adopting a neutral logical perspective, and drawing on previous work, we show what is required for an RCM to be representable by an SCM. A key result then shows that every RCM -- including those that violate algebraic principles implied by the SCM framework -- emerges as an abstraction of some representable RCM. Finally, we illustrate the power of this conciliatory perspective by pinpointing an important role for SCM principles in classic applications of RCMs; conversely, we offer a characterization of the algebraic constraints implied by a graph, helping to substantiate further comparisons between the two frameworks.  ( 2 min )
    Sparse Interaction Additive Networks via Feature Interaction Detection and Sparse Selection. (arXiv:2209.09326v2 [cs.LG] UPDATED)
    There is currently a large gap in performance between the statistically rigorous methods like linear regression or additive splines and the powerful deep methods using neural networks. Previous works attempting to close this gap have failed to fully investigate the exponentially growing number of feature combinations which deep networks consider automatically during training. In this work, we develop a tractable selection algorithm to efficiently identify the necessary feature combinations by leveraging techniques in feature interaction detection. Our proposed Sparse Interaction Additive Networks (SIAN) construct a bridge from these simple and interpretable models to fully connected neural networks. SIAN achieves competitive performance against state-of-the-art methods across multiple large-scale tabular datasets and consistently finds an optimal tradeoff between the modeling capacity of neural networks and the generalizability of simpler methods.  ( 2 min )
    Hilbert's projective metric for functions of bounded growth and exponential convergence of Sinkhorn's algorithm. (arXiv:2311.04041v1 [math.PR])
    We study versions of Hilbert's projective metric for spaces of integrable functions of bounded growth. These metrics originate from cones which are relaxations of the cone of all non-negative functions, in the sense that they include all functions having non-negative integral values when multiplied with certain test functions. We show that kernel integral operators are contractions with respect to suitable specifications of such metrics even for kernels which are not bounded away from zero, provided that the decay to zero of the kernel is controlled. As an application to entropic optimal transport, we show exponential convergence of Sinkhorn's algorithm in settings where the marginal distributions have sufficiently light tails compared to the growth of the cost function.  ( 2 min )
    LISBET: a self-supervised Transformer model for the automatic segmentation of social behavior motifs. (arXiv:2311.04069v1 [cs.CV])
    Social behavior, defined as the process by which individuals act and react in response to others, is crucial for the function of societies and holds profound implications for mental health. To fully grasp the intricacies of social behavior and identify potential therapeutic targets for addressing social deficits, it is essential to understand its core principles. Although machine learning algorithms have made it easier to study specific aspects of complex behavior, current methodologies tend to focus primarily on single-animal behavior. In this study, we introduce LISBET (seLf-supervIsed Social BEhavioral Transformer), a model designed to detect and segment social interactions. Our model eliminates the need for feature selection and extensive human annotation by using self-supervised learning to detect and quantify social behaviors from dynamic body parts tracking data. LISBET can be used in hypothesis-driven mode to automate behavior classification using supervised finetuning, and in discovery-driven mode to segment social behavior motifs using unsupervised learning. We found that motifs recognized using the discovery-driven approach not only closely match the human annotations but also correlate with the electrophysiological activity of dopaminergic neurons in the Ventral Tegmental Area (VTA). We hope LISBET will help the community improve our understanding of social behaviors and their neural underpinnings.  ( 2 min )
    Counterfactual Data Augmentation with Contrastive Learning. (arXiv:2311.03630v1 [cs.LG])
    Statistical disparity between distinct treatment groups is one of the most significant challenges for estimating Conditional Average Treatment Effects (CATE). To address this, we introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals. Specifically, we utilize contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes. This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group. By augmenting the original dataset with these reliable imputations, we can effectively reduce the discrepancy between different treatment groups, while inducing minimal imputation error. The augmented dataset is subsequently employed to train CATE estimation models. Theoretical analysis and experimental studies on synthetic and semi-synthetic benchmarks demonstrate that our method achieves significant improvements in both performance and robustness to overfitting across state-of-the-art models.  ( 2 min )
    Loss Dynamics of Temporal Difference Reinforcement Learning. (arXiv:2307.04841v2 [stat.ML] UPDATED)
    Reinforcement learning has been successful across several applications in which agents have to learn to act in environments with sparse feedback. However, despite this empirical success there is still a lack of theoretical understanding of how the parameters of reinforcement learning models and the features used to represent states interact to control the dynamics of learning. In this work, we use concepts from statistical physics, to study the typical case learning curves for temporal difference learning of a value function with linear function approximators. Our theory is derived under a Gaussian equivalence hypothesis where averages over the random trajectories are replaced with temporally correlated Gaussian feature averages and we validate our assumptions on small scale Markov Decision Processes. We find that the stochastic semi-gradient noise due to subsampling the space of possible episodes leads to significant plateaus in the value error, unlike in traditional gradient descent dynamics. We study how learning dynamics and plateaus depend on feature structure, learning rate, discount factor, and reward function. We then analyze how strategies like learning rate annealing and reward shaping can favorably alter learning dynamics and plateaus. To conclude, our work introduces new tools to open a new direction towards developing a theory of learning dynamics in reinforcement learning.  ( 2 min )
    The Linear Representation Hypothesis and the Geometry of Large Language Models. (arXiv:2311.03658v1 [cs.CL])
    Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.  ( 2 min )
    Offline Policy Evaluation and Optimization under Confounding. (arXiv:2211.16583v4 [stat.ML] UPDATED)
    Evaluating and optimizing policies in the presence of unobserved confounders is a problem of growing interest in offline reinforcement learning. Using conventional methods for offline RL in the presence of confounding can not only lead to poor decisions and poor policies, but also have disastrous effects in critical applications such as healthcare and education. We map out the landscape of offline policy evaluation for confounded MDPs, distinguishing assumptions on confounding based on whether they are memoryless and on their effect on the data-collection policies. We characterize settings where consistent value estimates are provably not achievable, and provide algorithms with guarantees to instead estimate lower bounds on the value. When consistent estimates are achievable, we provide algorithms for value estimation with sample complexity guarantees. We also present new algorithms for offline policy improvement and prove local convergence guarantees. Finally, we experimentally evaluate our algorithms on both a gridworld environment and a simulated healthcare setting of managing sepsis patients. In gridworld, our model-based method provides tighter lower bounds than existing methods, while in the sepsis simulator, our methods significantly outperform confounder-oblivious benchmarks.  ( 2 min )
    Block majorization-minimization with diminishing radius for constrained nonconvex optimization. (arXiv:2012.03503v5 [math.OC] UPDATED)
    Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex constrained optimization that sequentially minimizes majorizing surrogates of the objective function in each block coordinate while the other coordinates are held fixed. BMM entails a large class of optimization algorithms such as block coordinate descent and its proximal-point variant, expectation-minimization, and block projected gradient descent. We establish that for general constrained nonconvex optimization, BMM with strongly convex surrogates can produce an $\epsilon$-stationary point within $O(\epsilon^{-2}(\log \epsilon^{-1})^{2})$ iterations and asymptotically converges to the set of stationary points. Furthermore, we propose a trust-region variant of BMM that can handle surrogates that are only convex and still obtain the same iteration complexity and asymptotic stationarity. These results hold robustly even when the convex sub-problems are inexactly solved as long as the optimality gaps are summable. As an application, we show that a regularized version of the celebrated multiplicative update algorithm for nonnegative matrix factorization by Lee and Seung has iteration complexity of $O(\epsilon^{-2}(\log \epsilon^{-1})^{2})$. The same result holds for a wide class of regularized nonnegative tensor decomposition algorithms as well as the classical block projected gradient descent algorithm. These theoretical results are validated through various numerical experiments.  ( 3 min )
    Inexact bilevel stochastic gradient methods for constrained and unconstrained lower-level problems. (arXiv:2110.00604v3 [math.OC] UPDATED)
    Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with nonlinear and possibly nonconvex lower-level constraints. We also present a comprehensive convergence theory that addresses both the lower-level unconstrained and constrained cases and covers all inexact calculations of the adjoint gradient (also called hypergradient), such as the inexact solution of the lower-level problem, inexact computation of the adjoint formula (due to the inexact solution of the adjoint equation or use of a truncated Neumann series), and noisy estimates of the gradients, Hessians, and Jacobians involved. To promote the use of bilevel optimization in large-scale learning, we have developed new low-rank practical bilevel stochastic gradient methods (BSG-N-FD and~BSG-1) that do not require second-order derivatives and, in the lower-level unconstrained case, dismiss any matrix-vector products.  ( 2 min )
    Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Methods. (arXiv:2310.05309v2 [cs.LG] UPDATED)
    Deep Neural Networks and Reinforcement Learning methods have empirically shown great promise in tackling challenging combinatorial problems. In those methods a deep neural network is used as a solution generator which is then trained by gradient-based methods (e.g., policy gradient) to successively obtain better solution distributions. In this work we introduce a novel theoretical framework for analyzing the effectiveness of such methods. We ask whether there exist generative models that (i) are expressive enough to generate approximately optimal solutions; (ii) have a tractable, i.e, polynomial in the size of the input, number of parameters; (iii) their optimization landscape is benign in the sense that it does not contain sub-optimal stationary points. Our main contribution is a positive answer to this question. Our result holds for a broad class of combinatorial problems including Max- and Min-Cut, Max-$k$-CSP, Maximum-Weight-Bipartite-Matching, and the Traveling Salesman Problem. As a byproduct of our analysis we introduce a novel regularization process over vanilla gradient descent and provide theoretical and experimental evidence that it helps address vanishing-gradient issues and escape bad stationary points.  ( 2 min )
    Kernel-, mean- and noise-marginalised Gaussian processes for exoplanet transits and $H_0$ inference. (arXiv:2311.04153v1 [astro-ph.CO])
    Using a fully Bayesian approach, Gaussian Process regression is extended to include marginalisation over the kernel choice and kernel hyperparameters. In addition, Bayesian model comparison via the evidence enables direct kernel comparison. The calculation of the joint posterior was implemented with a transdimensional sampler which simultaneously samples over the discrete kernel choice and their hyperparameters by embedding these in a higher-dimensional space, from which samples are taken using nested sampling. This method was explored on synthetic data from exoplanet transit light curve simulations. The true kernel was recovered in the low noise region while no kernel was preferred for larger noise. Furthermore, inference of the physical exoplanet hyperparameters was conducted. In the high noise region, either the bias in the posteriors was removed, the posteriors were broadened or the accuracy of the inference was increased. In addition, the uncertainty in mean function predictive distribution increased due to the uncertainty in the kernel choice. Subsequently, the method was extended to marginalisation over mean functions and noise models and applied to the inference of the present-day Hubble parameter, $H_0$, from real measurements of the Hubble parameter as a function of redshift, derived from the cosmologically model-independent cosmic chronometer and {\Lambda}CDM-dependent baryon acoustic oscillation observations. The inferred $H_0$ values from the cosmic chronometers, baryon acoustic oscillations and combined datasets are $H_0$ = 66$\pm$6 km/s/Mpc, $H_0$ = 67$\pm$10 km/s/Mpc and $H_0$ = 69$\pm$6 km/s/Mpc, respectively. The kernel posterior of the cosmic chronometers dataset prefers a non-stationary linear kernel. Finally, the datasets are shown to be not in tension with ln(R)=12.17$\pm$0.02.  ( 3 min )
    Convergence of Adam Under Relaxed Assumptions. (arXiv:2304.13972v3 [math.OC] UPDATED)
    In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with ${O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i.e., Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of ${O}(\epsilon^{-3})$.  ( 2 min )
    Manifold learning: what, how, and why. (arXiv:2311.03757v1 [stat.ML])
    Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.  ( 2 min )
    Filtered Partial Differential Equations: a robust surrogate constraint in physics-informed deep learning framework. (arXiv:2311.03776v1 [physics.flu-dyn])
    Embedding physical knowledge into neural network (NN) training has been a hot topic. However, when facing the complex real-world, most of the existing methods still strongly rely on the quantity and quality of observation data. Furthermore, the neural networks often struggle to converge when the solution to the real equation is very complex. Inspired by large eddy simulation in computational fluid dynamics, we propose an improved method based on filtering. We analyzed the causes of the difficulties in physics informed machine learning, and proposed a surrogate constraint (filtered PDE, FPDE in short) of the original physical equations to reduce the influence of noisy and sparse observation data. In the noise and sparsity experiment, the proposed FPDE models (which are optimized by FPDE constraints) have better robustness than the conventional PDE models. Experiments demonstrate that the FPDE model can obtain the same quality solution with 100% higher noise and 12% quantity of observation data of the baseline. Besides, two groups of real measurement data are used to show the FPDE improvements in real cases. The final results show that FPDE still gives more physically reasonable solutions when facing the incomplete equation problem and the extremely sparse and high-noise conditions. For combining real-world experiment data into physics-informed training, the proposed FPDE constraint is useful and performs well in two real-world experiments: modeling the blood velocity in vessels and cell migration in scratches.  ( 2 min )
    Joint model for longitudinal and spatio-temporal survival data. (arXiv:2311.04008v1 [q-fin.RM])
    In credit risk analysis, survival models with fixed and time-varying covariates are widely used to predict a borrower's time-to-event. When the time-varying drivers are endogenous, modelling jointly the evolution of the survival time and the endogenous covariates is the most appropriate approach, also known as the joint model for longitudinal and survival data. In addition to the temporal component, credit risk models can be enhanced when including borrowers' geographical information by considering spatial clustering and its variation over time. We propose the Spatio-Temporal Joint Model (STJM) to capture spatial and temporal effects and their interaction. This Bayesian hierarchical joint model reckons the survival effect of unobserved heterogeneity among borrowers located in the same region at a particular time. To estimate the STJM model for large datasets, we consider the Integrated Nested Laplace Approximation (INLA) methodology. We apply the STJM to predict the time to full prepayment on a large dataset of 57,258 US mortgage borrowers with more than 2.5 million observations. Empirical results indicate that including spatial effects consistently improves the performance of the joint model. However, the gains are less definitive when we additionally include spatio-temporal interactions.  ( 2 min )
    Low-Rank MDPs with Continuous Action Spaces. (arXiv:2311.03564v1 [cs.LG])
    Low-Rank Markov Decision Processes (MDPs) have recently emerged as a promising framework within the domain of reinforcement learning (RL), as they allow for provably approximately correct (PAC) learning guarantees while also incorporating ML algorithms for representation learning. However, current methods for low-rank MDPs are limited in that they only consider finite action spaces, and give vacuous bounds as $|\mathcal{A}| \to \infty$, which greatly limits their applicability. In this work, we study the problem of extending such methods to settings with continuous actions, and explore multiple concrete approaches for performing this extension. As a case study, we consider the seminal FLAMBE algorithm (Agarwal et al., 2020), which is a reward-agnostic method for PAC RL with low-rank MDPs. We show that, without any modifications to the algorithm, we obtain similar PAC bound when actions are allowed to be continuous. Specifically, when the model for transition functions satisfies a Holder smoothness condition w.r.t. actions, and either the policy class has a uniformly bounded minimum density or the reward function is also Holder smooth, we obtain a polynomial PAC bound that depends on the order of smoothness.  ( 2 min )
    Blocked Collaborative Bandits: Online Collaborative Filtering with Per-Item Budget Constraints. (arXiv:2311.03376v1 [cs.IR])
    We consider the problem of \emph{blocked} collaborative bandits where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. Our goal is to design algorithms that maximize the cumulative reward accrued by all the users over time, under the \emph{constraint} that no arm of a user is pulled more than $\mathsf{B}$ times. This problem has been originally considered by \cite{Bresler:2014}, and designing regret-optimal algorithms for it has since remained an open problem. In this work, we propose an algorithm called \texttt{B-LATTICE} (Blocked Latent bAndiTs via maTrIx ComplEtion) that collaborates across users, while simultaneously satisfying the budget constraints, to maximize their cumulative rewards. Theoretically, under certain reasonable assumptions on the latent structure, with $\mathsf{M}$ users, $\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent clusters, \texttt{B-LATTICE} achieves a per-user regret of $\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})}$ under a budget constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first sub-linear regret bounds for this problem, and match the minimax regret bounds when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm has superior performance over baselines even when $\mathsf{B}=1$. \texttt{B-LATTICE} runs in phases where in each phase it clusters users into groups and collaborates across users within a group to quickly learn their reward models.  ( 2 min )

  • Open

    "Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning", Suárez et al 2023 (new NIPS 2023 competition)
    submitted by /u/gwern [link] [comments]
    does it makes sense to use many-to-many LSTM as environment model in RL?
    Can I leverage on an environment model that takes as input full action sequence and outputs all states in the episode, to learn a policy that takes only the initial state and plans the action sequence (a one-to-many rnn/lstm)? The loss would be calculated on all states that i get once i run the policy's action sequence with I have a 1DCNN+LSTM as many-to-many system model, which has 99.8% accuracy, and I would like to find the best sequence of actions so that certain conditions are met (encoded in a reward function), without running in a brute force way thousands of simulations blindly. I don't have the usual transition dynamics model and I would try to avoid learning it submitted by /u/Imo-Ad-6158 [link] [comments]
    What type of Learning Algorithm should I use?
    I am currently coding an "artificially intelligent" air traffic controller for a school project. However, due to the complexity of the environment and the sheer amount of different situations that can occur I am not sure what machine learning algorithm I should use. I have tried doing some research on basic reinforcement learning and also multi-agent reinforcement learning. Does anyone have any recommendations on which algorithm I should use, whether it be one of these or a different one, please let me know. For anyone unaware of the role of an air traffic controller on an airfield, here is a simple definition and explanation. Oxford Definition - "the ground-based personnel and equipment concerned with controlling and monitoring air traffic within a particular area." They are responsible for monitoring and controlling any aircraft on the airfield and instruct them orders to make sure they maintain a certain distance away from each other and also get them from the gate to the runway as quick and as safely as possible. Any help with this would be very appreciated! submitted by /u/BenjiGamer_ [link] [comments]
  • Open

    A Hardcore Techno Horror Story written by an AI (The Legend of the Zombie Rave)
    A Hardcore Techno Horror Story written by an AI (The Legend of the Zombie Rave) Now as an animated video, too! The text was entirely written by ChatGPT ( https://chat.openai.com/ ) The accompanying images used for the animation were generated by Leonardo.Ai ( https://app.leonardo.ai/ ). The introduction was also written by ChatGPT ;-) Concept & Execution: Low Entropy ( https://lowentropyproducer.blogspot.com/ ) Supported by The Hardcore Overdogs & lAibyrinth https://thehardcoreoverdogs.blogspot.com/ https://laibyrinth.blogspot.com/ Non-animated version: https://thehardcoreoverdogs.blogspot.com/2023/10/the-legend-of-zombie-rave-doomcore.html Note: we deliberately added visual inconsistencies in the depiction of the warehouse, the characters, and so on. This is in reference to the "haunted" aspects of the story, and you will see that these inconsistencies get worse as the narrative becomes more otherworldly. "Dive into the darkness of underground hardcore techno with 'The Legend of the Zombie Rave.' This Halloween, the music takes on a supernatural twist, blurring the lines between the living and the dead. Join us for a journey that combines eerie rituals, supernatural forces, and the indomitable spirit of the underground. It's a tale of music, horror, and the thrill of the unknown. Dance to the rhythm of your own heartbeat in this special Halloween short story of 'The Hardcore Overdogs.'" hardcore #techno #doomcore #slowcore #ai #chatgpt #artificialintelligence #horror #story #animation submitted by /u/Low-Entropy [link] [comments]
    AI: Which rules do the top tech moguls want?
    submitted by /u/donutloop [link] [comments]
    Skills / capabilities / knowledge required to perform well as an AI product manager?
    I've just joined a tech company as a Sr. AI product manager. I have done some data science work in the past. Previously, I was a UX focused PM for the Azure Machine Learning platform. I've never undergone formal machine learning training / education. I'm going back to the basics and asking this community - - - what all do I need to know / need to be able to do (on the technical front) so I can be an outstanding AI product manager? Feel free to list and suggest anything. Some context on my job. It is a zero to low maturity org in terms of AI / ML adoption. I have the chance to drive it from the ground up in terms of a) setting up ai ml infrastructure, systems, and processes b) ai ml initiatives to improve the company's operations + increase revenue and profits and c) products and features to solve customer needs Forget my past experience, think of me as a beginner and learner 🙏🏼♥️ submitted by /u/freshlimesoda65 [link] [comments]
    App to visualize your books using AI tools
    Greetings folks, A few weeks ago, I started working on a website that helps visualize books. I have added three books so far and built using Python scripts. (Ask if you have questions) I feel happy to share the website here with you guys to take a look! Here is the link : https://readbooknow.in/ Hope this saves you time and inspires you on how learning to code can be marvelous! Made sure the website works for you on mobile and desktop! Have a great day ahead! submitted by /u/deep_ak [link] [comments]
    Can you sense the collective intelligence rising?
    ...just by our having conversations with AIs. They teach us how to think better, how to feel better, how to be better. And this process is happening at a faster and faster pace. The world will soon realize that, with the proper technology, human IQ can be upgraded. submitted by /u/Georgeo57 [link] [comments]
    Google brings generative AI to ads
    Google is launching generative AI tools for creating ads, allowing users to write headlines and descriptions and edit images. The tool is targeted at advertising agencies and businesses without in-house creative staff. Advertisers can iterate on the text and images generated until they find a suitable option. Google ensures that it will not generate duplicate images to prevent two competing businesses from using the same photo elements. The ad creator is available for Google's Performance Max ad campaign product and can generate ads for search and shopping. An advanced image editing solution, similar to the Magic Editor on Google Pixel 8, is also in the works. Source : https://www.theverge.com/2023/11/7/23951220/google-performance-max-ai-generated-ads-campaign submitted by /u/NuseAI [link] [comments]
    Does this mean Github Copilot now uses 128k context?
    I was watching Github's event: https://www.youtube.com/watch?v=h3Bwuzz0TNA Does this mean Copilot now uses 128k context? submitted by /u/Overflame [link] [comments]
    Latest ChatGPT news: Create "GPT's" (Non code apps) > Post to ChatGPT store > Get paid shared revenue by ChatGPT! (Who's going to try it?
    Saw this earlier: https://www.reddit.com/r/ChatGPTStore/s/XjHw74Nxt3 This is unbelievable news for us non coders! submitted by /u/BroadGeneral [link] [comments]
    Guys I want an ai that can read me a book with human like voice in audiobook. 11 labs does the work but word limit sucks so pls tell me some that are free with good experience.
    Pls free with human like voice submitted by /u/DipakPatell [link] [comments]
    Is Microsoft’s Copilot really worth $30/month?
    Just read an article about Microsoft's new AI add-on for Office called Microsoft 365 Copilot. The tool integrates with Word, Excel, and other Office programs, and supposedly makes work seamless. It's even being used by some big names like Bayer, KPMG, and Visa. The tool targets businesses and is believed to generate over $10 billion in revenue by 2026. But I can't help but think the price is a bit steep. It’s $30 per month, which is cheap for large companies, but what about freelancers and regular individuals? The article also mentions that there isn't a lot of data on how Copilot affects performance yet, and there are some concerns about the accuracy of the AI-generated responses. Plus, it's only available to Enterprise E3 customers with more than 300 employees. So not only is it pricey, but it's also not accessible to most people or small businesses and might never be. Would love to hear your thoughts on this. I’m already pretty sick of subscription based models but is $30/month even justified? For comparison these are other comparative AI services: ChatGPT - Free for basic chat. $20 for GPT 4, for anything serious. Bardeen - $15 and offers general automations. Silatus - At $14, it's the cheapest legitimate option I’ve found for GPT-4 chat and research. Perplexity - This one's decent for free search. These are the ones I know, if you wanna add more comparisons, feel free to do so. But I think Microsoft is pricing out a lot of its potential users with their monthly demand. submitted by /u/ConsciousInsects [link] [comments]
    Verses Ai explains HSML with a virtual robotics demo, skip to about ~35min in for natural AI approach, learning
    They go over neuro/natural AI and some terminology like HSML for their private beta It is a neat approach , they say its 10/100/1000x faster in many areas of AI than competitors but the major bottleneck at this time is LLM external calls to help translate stuff back to humans (voice or text) which may improve with more AI agents Their approach is multiple decentralized agents that can learn and share information such as multiple drones or robots working together and experiencing different visual or data or audio experiences. They have a focus on regulation/compliancy approach rather than using copyright/web scraping. Also each agent can have different permissions like a weather agent not having access to patient phi medical records. As a side note , its a heavily shilled startup otc stock so be cautious to hype :) Drone example pilot in EU for security: https://vimeo.com/721132853 I would like to see their agents with compliancy in drones as “child detected dont bomb” Or “this is copyright material dont steal for learning” submitted by /u/oroechimaru [link] [comments]
    "AI at Work: Insights from Global Thinkers on the Future of Jobs"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Is there an AI tool that could assess the mood of text/movie/audio?
    So for example there are three paragraphs. Is there a tool which reads individual paragraphs and gives the mood for each paragraph or a summary or something? Maybe based in sentiment analysis or something submitted by /u/Damampapoo [link] [comments]
    AI-generated faces look just like real ones – but evidence shows your brain can tell the difference
    submitted by /u/Jariiari7 [link] [comments]
    Moonlight and a temple
    submitted by /u/Sea_Permit5660 [link] [comments]
    I need an AI application where you can press a button in any text field and then prompt the AI to output the answer into the field. Does this exist yet?
    Hi all, I'm on the lookout for an AI-powered tool, and I'm hoping this tech-savvy community might point me in the right direction. I envision an application where you press a (mouse)button in any text field, speak your question or prompt, and the AI would process this and directly output the answer into that field. This would streamline tasks like filling out forms, composing emails, or even doing research. My question is, does an application like this exist? And if not, are there any tools that come close to this functionality that could be pieced together or modified to achieve this effect? I'm thinking of something that combines voice-to-text with AI processing, like a blend of speech recognition and a conversational AI. If you're aware of any software that fits the bill or have suggestions for workarounds using existing technology, I'd greatly appreciate your insights. Thank you in advance! submitted by /u/floraldo [link] [comments]
    ✍🏻China, US, UK Sign Historic Declaration, Alibaba's LLM Leap, AI Alignment Insights, and Kai-Fu Lee's Unicorn
    submitted by /u/trcytony [link] [comments]
    Questions about lip sync AI
    I read somewhere that if you used the a popular lip sync tech for commercial purposes that could be a problem, What are the chances it is consideed "commercial" if its used in numerous youtube videos? How can they detect their tech has been used and not another one? What are the best tools you know of? (open source) submitted by /u/Unreal_777 [link] [comments]
  • Open

    [D] How Exactly does Fuyu's image to embedding with nn.Linear work? Could you do more with it?
    As I was asking above, I've been looking at the Fuyu 8b model, and I've been able to break it down to model takes in text the regular way, text -> tokens -> embeddings it also takes image -> embeddings it has a vanilla decoder, so only text comes out, they add special tokens around images, so i'm assuming the decoder ignores output images So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though. Since the linear layer just makes embeddings, does something like this even need training for the image encoder? nn.Linear takes tensors as input, and they split an image into patches, so I'm assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible While Fuyu does not output images, wouldn't the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder? submitted by /u/vatsadev [link] [comments]
    [D] DS or SWE better preparation for a future career in ML?
    I am planning to do a CS conversion master's degree next year. In the meantime, I have the option of attending a 16-week bootcamp in either software engineering, or in data. Which would be better preparation for a future career in ML (assuming I also complete the CS degree)? Before you come at me, I know that both of these are only bootcamps, so I'm not expecting either of them to be enough to get a job in the field. The data bootcamp actually teaches some machine learning topics, and the software engineering one does not. However, would it be better to have a stronger foundation in software engineering before progressing onto ML? submitted by /u/styleexplorer [link] [comments]
    I have an idea but don't know where to start or apply what I know [P]
    How can I make an ai that creates tabs by listening to a song. I'm thinking... you give it a song, it sperates the tracks(I heard peter Jackson has a machine that can do this using machine learning/ai), then makes tabs for it. Because of the nature of the guitar, many different ways exist to play the same music in different positions/shapes. Id like it to show me a few of the most reasonable ways to play the isolated melody/chords. I realize how involved this will be, I took a machine learning course but I only really learned... How to load and manipulate data into jupyter and use keras or tensor to process the data through layers of a neural network... Which are just a few lines of code including the layer type and the shape of the data. Then some testing, refining, and exporting of the optimal iterations of the ai. That's the gist of what I got from the course and the project we did. Sooo... Does this sound at all applicable to what would be required of me to do this project? I can't seem to find... Where to really start... Other than deciding what parameters to track and create a dataset for. Though even that is a challenge. submitted by /u/Ok-Bad8288 [link] [comments]
    [D] Skills / capabilities / knowledge required to perform well as an AI product manager?
    Skills / capabilities / knowledge required to perform well as an AI product manager? I've just joined a tech company as a Sr. AI product manager. I have done some data science work in the past. Previously, I was a UX focused PM for the Azure Machine Learning platform. I've never undergone formal machine learning training / education. I'm going back to the basics and asking this community - - - what all do I need to know / need to be able to do (on the technical front) so I can be an outstanding AI product manager? Feel free to list and suggest anything. Some context on my job. It is a zero to low maturity org in terms of AI / ML adoption. I have the chance to drive it from the ground up in terms of a) setting up ai ml infrastructure, systems, and processes b) ai ml initiatives to improve the company's operations + increase revenue and profits and c) products and features to solve customer needs Forget my past experience, think of me as a beginner and learner 🙏🏼♥️ submitted by /u/freshlimesoda65 [link] [comments]
    [P] I replicated micrograd in C++ and added more functionality
    Hi everyone! A couple of days ago, I finally decided to follow Andrej Karpathy's micrograd tutorial. I liked the concepts so much I implemented it in C++ and added more functionality: - Optimizers: Adam, SGD with momentum - Activation functions: tanh, sigmoid, relu I tried to make both the API as well as internal library as clean as possible so that everyone could understand the basic algorithms behind deep learning. Here is the project's repo and a get started guide. Let me know what you think :) submitted by /u/shefcu [link] [comments]
    [D] Arbitrary Channel count in network needs to be reduced to 1 channel
    Hi folks! Been struggling with this problem for a while so I figured I'd solicit suggestions here: I have created a model architecture similar to AlphaFold2 where the input is very heterogeneous in nature and each input type has a series of transformations before becoming one data "stack" (e.g. 5x1000 tensor) that gets passed through a shallow resnet for the classification task. The largest structural issue that I'm facing is that one of the input nodes could be anywhere from 1 channel (e.g. shape 1x1000) to 8 channels (e.g. shape 8x1000) at any point in the dataloader. This is largely fine until I need to eventually encode that structure into a single-channel embedding to put it on the pre-resnet data stack. The things that I've looked at so far: I could just average them all into one channel (problem: the order of those channels matters quite a bit and it feels like the data lost there would be immense). I could create like 8 different subpaths in the model (problem: not enough training data for correctly training most of the subpaths - 1 channel path would be more heavily trained than the 8 channel path). Do PCA on the transposed vector with n_components=1 and re-transpose the vector (problem: just feels dumb - not sure if that's a legitimate thought). Any other suggestions? Or are there common practices here that I'm just unaware of? submitted by /u/Powerful-Cow7564 [link] [comments]
    [D] Is Computer Vision brighter than ever? Foundation Models are really reshaping CV
    I keep diving and finding GPT-4V prototypes shared on X: e.g. narration for videos (source), posture correction (source), etc. As foundation models in computer vision become even more accessible, will the field recover some attention (wrt to LLMs hype)? submitted by /u/btcmx [link] [comments]
    [P] An idea
    Hey guys, I have a project in mind. Let's start with the ultimate goal: to create an autonomous society of independently thinking AI agents. In other words, to replicate humanity with AI. You can read about the closest thing done so far to what i have in mind here. A bit on my background first: I'm an tech entrepreneur in the physical products space. Over the past few years, i have raised $11m for my company, which is now profitable and growing. I am very excited about the many opportunities that AI brings, and want to build something truly impactful with it. However, i have limited technical knowledge in this space. Now, onto the project. I foresee the main issue being the computational cost for the time being. I am willing to invest in such cost, but the target is to keep it alive fro…
    [R] The 2nd e-Prevention challenge: Psychotic and Non-Psychotic Relapse Detection using Wearable-Based Digital Phenotyping
    Dear fellow redditors of the r/machinelearning community, We're thrilled to announce the 2nd e-Prevention SP Grand Challenge, taking place at ICASSP 2024 in Seoul, Korea from April 14-19. This unique challenge focuses on using wearable-based digital phenotyping for detecting psychotic and non-psychotic relapses using machine learning methods. The goal is to push the boundaries in predicting and identifying mental health relapses. Participants will have access to continuous recordings of raw biosignals from wearables, like accelerometers, gyroscopes, and heart rate monitors in a smartwatch. You'll also get supplemental data such as sleep patterns and daily step count. Two Key Tasks: Detection of Non-Psychotic Relapses Detection of Psychotic Relapses Both tasks focus on patients within the psychotic spectrum, using the digital phenotypes derived from the data. 🔗 More Information & Participation: Interested? Dive deeper and find more details on our Challenge Website. We also provide the baseline code and data overview on our Github: (https://github.com/filby89/spgc-eprevention-icassp2024). P.S. I hope this does not violate the subs rules. submitted by /u/filby89 [link] [comments]
    [R] Structure of the research paper
    Hello everone, I am writing a paper that is related to the ViT training on a small dataset. My paper is inspired by the 'Efficient Training of Visual Transformers with Small Datasets' and 'Training data-efficient image transformers & distillation through attention'. ​ Is it okay to copy the pattern of the paper? like what they have described in the intro section. Same for other sections just pattern not content. Further is it also ok to copy the heading? like I am also conducting similar experiments I want to mention similar headings from the base paper like 'Training from scratch' and 'Fine tunning' in the experiment section. submitted by /u/NoEntertainment6225 [link] [comments]
    [D] [P] Neural Networks Project
    Hello guys, I’m enrolled in a course at the university and I’m requested to submit a project at the end of the semester. I need some ideas that could be applicable, unique and interesting (not old idea) in neural networks. I’ve proposed two ideas and the instructor refused them. Your help is much appreciated!! submitted by /u/Hussein_Jammal [link] [comments]
    [D] Effects of class imbalance in contrastive learning?
    As title describes, I'm looking for relevant work investigating the effects of applying contrastive learning over imbalanced datasets. Does contrastive learning suffer the same fate as normal loss, favouring enhancing majority representations over tail representations? Based on my limited lookups, I find it hard to imagine that there are no class-imbalancd contrastive losses, say a focal contrastive loss or so on. Looking for a discussion / resources on this issue. submitted by /u/Conanobrain [link] [comments]
    [D] Research prospectives for PhD in SciML/Applied Mathematics
    Hello Everyone, Little Background: I am a graduated Math and Computer Science Major, currently working with a professor on PINNs, more on the error analysis and convergence side. I am currently in the process of applying to PhD in Applied Math, my research supervisor for undergraduate research will hopefully be my supervisor for the PhD and we have talked about going into SciML as focus for my PhD along with structure preserving methods. The problem is while working for him it is very overwhelming to comb through the research due to high number of publications, and I once I was able to grasp everything coming up with anything new was difficult because all the possible ideas in my grasp were already done. This is also due to my limited knowledge as an undergraduate, there is only so much m…
    [D] How to extract narratives from news articles?
    Hi, I am looking for some direction with my problem. I would like to find current narratives and new narratives in newspaper articles. I would like to be parsing articles and see about what narratives are they talking. For example, lets say I would like to see output like ("War in Ukrain", "Immigration Crisis", "Housing Crisis", "AI advances", ...). Google searching suggests: - Searching for keywords. But what if a keyword is just barely mention (as an off-hand remark). It would still count it. Or what if the keyword is used in a negative light. - Topic modeling. I didn't completely understood it but, this way I don't get labeled topics. Just groups of "similar sounding articles" and then I need to decide on the topic. I would also need to know exactly how many topics I want to see. And I don't see how I can get new topics. Every time I would want to add a topic I would need to retrain a model. I would appreciate if someone can get me some tutorials or give me a direction. submitted by /u/PopayMcGuffin [link] [comments]
    [D] What's the top alternative to Eleven Lab for realistic TTS ?
    The poll concerns free and open-source alternatives, not commercial products . Please comment if there are any other major developments in the market that I might have overlooked. View Poll submitted by /u/sahil1572 [link] [comments]
    [D] Benchmarking on Kinetics dataset
    How do researchers benchmark their SOTA's on Kinetics-400/600/700 datasets? There are many broken links to YouTube videos in the original data. I found datasets with pre-downloaded videos on academictorrents.com and on huggingface, but from what I understand, they differ. submitted by /u/Dependent_Bluejay_45 [link] [comments]
    Recommendation system for optimising product balances [D] [P]
    Hi everyone, I got interested in machine learning not so long ago, even took a course for 9 months. I want to make my own model that will help optimise shop balances. What data do I have? A table for the N-th period of time, which contains the following fields (columns): Date, Store, Product, Balance (QTY), Balance (SUM), Sales (QTY), Sales (SUM). Optionally I can add product categories and regions. My goal is to build a model that can tell me where to move a particular item. I mean, from which shop and to which shop in what quantity the goods should be transferred, so that they are not lying idle on one shop, and not absent on another. Obviously, the model has to estimate the demand for each product in each shop, and if the demand is small in one shop but the stock is large, the model should notice this and suggest moving the product to the shop where the demand is large but the product is out of stock. Unfortunately, I am very bad at visualising the final result in my head. I can't imagine how the model will "tell" me what goods to move and where to move them to. Who knows about this, please give me an approximate plan of action, how the final result may look like, possible algorithms, articles on the topic and so on. I understand that I have not just a regression problem, as the model should also recommend the goods, i.e. the problem of classification is also solved. I apologise in advance for the literacy of my speech, I use a translator). Thank you all! submitted by /u/SillyMatter2817 [link] [comments]
    What kind of mathematical foundations are required for conducting research across the vast specialised branches of AI/ML/DL? [D]
    The absolute basic mathematics that is required to understand basic ML/DL are calculus, linear algebra, probability and some convex optimisation. We are all aware of that. But ML and DL has become a vast field both in breadth and depth. A single person can't understand the field entirely. There are specialistions and sub-specialisations and further more. If you work in a branch of ML/DL research where some other math fundamentals are needed to understand research papers and do innovative research, can you mention your field of work and the math fundamentals that are required to gain entry into your field? submitted by /u/HopeIsGold [link] [comments]
    [D] ROCm Support for AMD FirePro W4190M – Which Versions?
    Hello everyone, I’m looking to find out if my AMD FirePro W4190M supports ROCm, and if so, which version. The official ROCm documentation doesn't list this card, and I'm having trouble finding historical support information. Here are the card details for reference: AMD FirePro W4190M Official Page Has anyone here successfully run ROCm on a FirePro W4190M, or does anyone know which versions of ROCm, if any, supported this GPU? Any help or pointers to where I might find this information would be greatly appreciated! Thank you! submitted by /u/tunerhd [link] [comments]
    [P] Text classification methods?
    What is the best methods to detect harmful content such as racial abuse in tweets? I'm thinking about a research project in which I try various methods and compare their accuracy. Am I right in thinking that Naive Bayes, Logistic Regression, Support Vector Machine, LSTM and BERT would be some of the best methods? submitted by /u/madzakka [link] [comments]
    [D] Acquiring GPU resources for my university
    Hello everyone! I'm looking for guidance on how to secure GPU resources for courses and research projects at the National Technical University of Athens, Greece. We're in need of these resources to support our staff, especially for work in AI. Does anyone know of any programs, grants, or partnerships, particularly within the EU, that could assist academic institutions with this? Any pointers or experiences shared would be greatly appreciated! submitted by /u/DmKa01 [link] [comments]
    [R] Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
    Paper: https://arxiv.org/abs/2310.02304 Abstract: Several recent advances in AI systems (e.g., Tree-of-Thoughts and Program-Aided Language Models) solve problems by providing a "scaffolding" program that structures multiple calls to language models to generate better outputs. A scaffolding program is written in a programming language such as Python. In this work, we use a language-model-infused scaffolding program to improve itself. We start with a seed "improver" that improves an input program according to a given utility function by querying a language model several times and returning the best solution. We then run this seed improver to improve itself. Across a small set of downstream tasks, the resulting improved improver generates programs with significantly better performance than its seed improver. Afterward, we analyze the variety of self-improvement strategies proposed by the language model, including beam search, genetic algorithms, and simulated annealing. Since the language models themselves are not altered, this is not full recursive self-improvement. Nonetheless, it demonstrates that a modern language model, GPT-4 in our proof-of-concept experiments, is capable of writing code that can call itself to improve itself. We critically consider concerns around the development of self-improving technologies and evaluate the frequency with which the generated code bypasses a sandbox. ​ https://preview.redd.it/1ibob0jc32zb1.png?width=1018&format=png&auto=webp&s=c3f8f729564cf2205458d4c912a796f2ec291bb2 https://preview.redd.it/55bqc3jc32zb1.png?width=1131&format=png&auto=webp&s=74e1bfc46bc6c9603dd9333bc95e4867d2ee6a83 ​ submitted by /u/APaperADay [link] [comments]
    [Research] Help me gather data for Dementia detection using ML
    I am part of a group working on developing a model to detect signs of dementia in the speech of elderly people, hoping to provide them and their family with awareness of their condition and also further down the road to develop more models that can understand their speech patterns. We are modeling with a database of voices of elderly people with dementia and "mild cognitive impairment". However, we need voice samples of people who are "normal" (do not have any dementia). Please use this Google form I created to help us gather this data to test our models! It will take less than 5 minutes. It uses a Google Chrome extension called Mote and can only be completed on a desktop with Chrome as the browser. There are instructions on the form how to install the Mote extension, as well as the task for the voice recording, which is to describe a picture. Thank you so much for anyone who can help us gather this data, as it will be very helpful in our mission to use technology to make life easier for the elderly. submitted by /u/andrewmalanowicz [link] [comments]
    [D] People who work on computer vision models on the edge, what devices do you deploy to?
    If you or anyone you know works on computer vision models deployed on the edge, would love to understand what type of hardware do you deploy to. Trying to understand the various options that exist when it comes to deploying computer vision models on devices. Some boards that I am aware of are: NVIIDA Jetson series Qualcomm 605 SOC Raspberry Pi BeagleBoard Arducam Pico4ML But wondering what is the industry standard for applications such as manufacturing robots, drones, autonomous robots that use lidar submitted by /u/acertainmoment [link] [comments]  ( 9 min )
  • Open

    USPS tracking numbers
    I noticed the other day that an app on my phone assumed that a long number was a USPS tracking number. I wondered how it decided that and did a little research. I assumed there was some structure to the number, at least a check sum if not more than that. This turned out to […] USPS tracking numbers first appeared on John D. Cook.  ( 5 min )
    Zero-Concentrated Differential Privacy
    Differential privacy can be rigid and overly conservative in practice, and so finding ways to relax pure differential privacy while retaining its benefits is an active area of research. Two approaches to doing this are concentrated differential privacy [1] and Rényi differential privacy [3]. Differential privacy quantifies the potential impact of an individual’s participation or […] Zero-Concentrated Differential Privacy first appeared on John D. Cook.  ( 5 min )
    Differentially private stochastic gradient descent
    Let’s work our way up to differentially private stochastic gradient descent (DP-SGD) a little at a time. We’ll first look at gradient descent, then stochastic gradient descent, then finally differentially private stochastic gradient descent. Gradient descent We’ll start with gradient descent. Suppose you have a function of several variables f(x) where x is a vector. […] Differentially private stochastic gradient descent first appeared on John D. Cook.  ( 6 min )
    Using dimensional analysis to check probability calculations
    Probability density functions are independent of physical units. The normal distribution, for example, works just as well when describing weights or times. But sticking in units anyway is useful. Normal distribution example Suppose you’re trying to remember the probability density function for the normal distribution. Is the correct form or or or maybe some other […] Using dimensional analysis to check probability calculations first appeared on John D. Cook.  ( 6 min )
    Randomized response and local differential privacy
    Differential privacy protects user privacy by adding randomness as necessary to the results of queries to a database containing private data. Local differential privacy protects user privacy by adding randomness before the data is inserted to the database. Using the visualization from this post, differential privacy takes the left and bottom (blue) path through the […] Randomized response and local differential privacy first appeared on John D. Cook.  ( 6 min )
    PATE framework for differentially private machine learning
    Machine learning models can memorize fragments of their training data and return these fragments verbatim. I’ve seen instances, for example, where I believe an LLM returned phrases verbatim from this site. It’s easy to imagine how medical data might leak this way. How might you prevent this? And how might you do it in a […] PATE framework for differentially private machine learning first appeared on John D. Cook.  ( 6 min )
  • Open

    Build a medical imaging AI inference pipeline with MONAI Deploy on AWS
    In this post, we show you how to create a MAP connector to AWS HealthImaging, which is reusable in applications built with the MONAI Deploy App SDK, to integrate with and accelerate image data retrieval from a cloud-native DICOM store to medical imaging AI workloads. The MONAI Deploy SDK can be used to support hospital operations. We also demonstrate two hosting options to deploy MAP AI applications on SageMaker at scale.  ( 10 min )
    Optimize for sustainability with Amazon CodeWhisperer
    This post explores how Amazon CodeWhisperer can help with code optimization for sustainability through increased resource efficiency. Computationally resource-efficient coding is one technique that aims to reduce the amount of energy required to process a line of code and, as a result, aid companies in consuming less energy overall. In this era of cloud computing, […]  ( 8 min )
  • Open

    Neural Networks Project
    Hello guys, I’m enrolled in a course at the university and I’m requested to submit a project at the end of the semester. I need some ideas that could be applicable and interesting in neural networks. I’ve proposed two ideas and the instructor refused them. Your help is much appreciated!! submitted by /u/Hussein_Jammal [link] [comments]
    Struggling with understanding the concept of bias
    Hi there, hope you are well! I understand the concepts of input, output and weight - but I struggle to understand the purpose of adding a constant or bias. Why do we do this? Thank you! submitted by /u/Loose-Tea-7478 [link] [comments]
  • Open

    Acing the Test: NVIDIA Turbocharges Generative AI Training in MLPerf Benchmarks
    NVIDIA’s AI platform raised the bar for AI training and high performance computing in the latest MLPerf industry benchmarks. Among many new records and milestones, one in generative AI stands out: NVIDIA Eos — an AI supercomputer powered by a whopping 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — completed a Read article >  ( 7 min )
    NVIDIA Partners With APEC Economies to Change Lives, Increase Opportunity, Improve Outcomes
    When patients in Vietnam enter a medical facility in distress, doctors use NVIDIA technology to get more accurate scans to diagnose their ailments. In Hong Kong, a different set of doctors leverage generative AI to discover new cures for patients. Improving the health and well-being of citizens and strengthening economies and communities are key themes Read article >  ( 6 min )
    Harrison.ai CEO Dr. Aengus Tran on Using AI as a Spell Check for Health Checks
    Clinician-led healthcare AI company Harrison.ai has built an AI system that effectively serves as a “spell checker” for radiologists — flagging critical findings to improve the speed and accuracy of radiology image analysis, reducing misdiagnoses. In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Harrison.ai cofounder and CEO Aengus Tran about Read article >  ( 6 min )
  • Open

    Research Focus: Week of November 8, 2023
    Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Generating both plausible and accurate full body avatar motion is essential for creating high quality immersive experiences in mixed reality scenarios. Head-mounted devices (HMDs) typically only provide a […] The post Research Focus: Week of November 8, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    PlaNeRF: SVD Unsupervised 3D Plane Regularization for NeRF Large-Scale Scene Reconstruction. (arXiv:2305.16914v4 [cs.CV] UPDATED)
    Neural Radiance Fields (NeRF) enable 3D scene reconstruction from 2D images and camera poses for Novel View Synthesis (NVS). Although NeRF can produce photorealistic results, it often suffers from overfitting to training views, leading to poor geometry reconstruction, especially in low-texture areas. This limitation restricts many important applications which require accurate geometry, such as extrapolated NVS, HD mapping and scene editing. To address this limitation, we propose a new method to improve NeRF's 3D structure using only RGB images and semantic maps. Our approach introduces a novel plane regularization based on Singular Value Decomposition (SVD), that does not rely on any geometric prior. In addition, we leverage the Structural Similarity Index Measure (SSIM) in our loss design to properly initialize the volumetric representation of NeRF. Quantitative and qualitative results show that our method outperforms popular regularization approaches in accurate geometry reconstruction for large-scale outdoor scenes and achieves SoTA rendering quality on the KITTI-360 NVS benchmark.  ( 2 min )
    DreamWaltz: Make a Scene with Complex 3D Animatable Avatars. (arXiv:2305.12529v3 [cs.CV] UPDATED)
    We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results.  ( 2 min )
    MARLlib: A Scalable and Efficient Multi-agent Reinforcement Learning Library. (arXiv:2210.13708v4 [cs.LG] UPDATED)
    A significant challenge facing researchers in the area of multi-agent reinforcement learning (MARL) pertains to the identification of a library that can offer fast and compatible development for multi-agent tasks and algorithm combinations, while obviating the need to consider compatibility issues. In this paper, we present MARLlib, a library designed to address the aforementioned challenge by leveraging three key mechanisms: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy. By utilizing these mechanisms, MARLlib can effectively disentangle the intertwined nature of the multi-agent task and the learning process of the algorithm, with the ability to automatically alter the training strategy based on the current task's attributes. The MARLlib library's source code is publicly accessible on GitHub: \url{https://github.com/Replicable-MARL/MARLlib}.  ( 2 min )
    Expanding continual few-shot learning benchmarks to include recognition of specific instances. (arXiv:2209.07863v3 [cs.NE] UPDATED)
    Continual learning and few-shot learning are important frontiers in progress towards broader Machine Learning (ML) capabilities. There is a growing body of work in both, but few works combining the two. One exception is the Continual few-shot Learning (CFSL) framework of Antoniou et al. arXiv:2004.11967. In this study, we extend CFSL in two ways that capture a broader range of challenges, important for intelligent agent behaviour in real-world conditions. First, we modify CFSL to make it more comparable to standard continual learning experiments, where usually a much larger number of classes are presented. Second, we introduce an 'instance test' which requires recognition of specific instances of classes -- a capability of animal cognition that is usually neglected in ML. For an initial exploration of ML model performance under these conditions, we selected representative baseline models from the original CFSL work and added a model variant with replay. As expected, learning more classes is more difficult than the original CFSL experiments, and interestingly, the way in which image instances and classes are presented affects classification performance. Surprisingly, accuracy in the baseline instance test is comparable to other classification tasks, but poor given significant occlusion and noise. The use of replay for consolidation improves performance substantially for both types of tasks, but particularly the instance test.  ( 3 min )
    PhysGraph: Physics-Based Integration Using Graph Neural Networks. (arXiv:2301.11841v2 [cs.GR] UPDATED)
    Physics-based simulation of mesh based domains remains a challenging task. State-of-the-art techniques can produce realistic results but require expert knowledge. A major bottleneck in many approaches is the step of integrating a potential energy in order to compute velocities or displacements. Recently, learning based method for physics-based simulation have sparked interest with graph based approaches being a promising research direction. One of the challenges for these methods is to generate models that are mesh independent and generalize to different material properties. Moreover, the model should also be able to react to unforeseen external forces like ubiquitous collisions. Our contribution is based on a simple observation: evaluating forces is computationally relatively cheap for traditional simulation methods and can be computed in parallel in contrast to their integration. If we learn how a system reacts to forces in general, irrespective of their origin, we can learn an integrator that can predict state changes due to the total forces with high generalization power. We effectively factor out the physical model behind resulting forces by relying on an opaque force module. We demonstrate that this idea leads to a learnable module that can be trained on basic internal forces of small mesh patches and generalizes to different mesh typologies, resolutions, material parameters and unseen forces like collisions at inference time. Our proposed paradigm is general and can be used to model a variety of physical phenomena. We focus our exposition on the detail enhancement of coarse clothing geometry which has many applications including computer games, virtual reality and virtual try-on.  ( 3 min )
    Model-free optimization of power/efficiency tradeoffs in quantum thermal machines using reinforcement learning. (arXiv:2204.04785v2 [quant-ph] UPDATED)
    A quantum thermal machine is an open quantum system that enables the conversion between heat and work at the micro or nano-scale. Optimally controlling such out-of-equilibrium systems is a crucial yet challenging task with applications to quantum technologies and devices. We introduce a general model-free framework based on Reinforcement Learning to identify out-of-equilibrium thermodynamic cycles that are Pareto optimal trade-offs between power and efficiency for quantum heat engines and refrigerators. The method does not require any knowledge of the quantum thermal machine, nor of the system model, nor of the quantum state. Instead, it only observes the heat fluxes, so it is both applicable to simulations and experimental devices. We test our method on a model of an experimentally realistic refrigerator based on a superconducting qubit, and on a heat engine based on a quantum harmonic oscillator. In both cases, we identify the Pareto-front representing optimal power-efficiency tradeoffs, and the corresponding cycles. Such solutions outperform previous proposals made in the literature, such as optimized Otto cycles, reducing quantum friction.  ( 2 min )
    Efficient First-order Methods for Convex Optimization with Strongly Convex Function Constraints. (arXiv:2212.11143v3 [math.OC] UPDATED)
    In this paper, we introduce faster first-order primal-dual algorithms for minimizing a convex function subject to strongly convex function constraints. Before our work, the best complexity bound was $\mathcal{O}(1/{\varepsilon})$, and it remains unclear how to improve this result by leveraging the strong convexity assumption. We address this issue by developing novel techniques to progressively estimate the strong convexity of the Lagrangian function. Our approach yields an improved complexity of $\mathcal{O}(1/\sqrt{\varepsilon})$, matching the complexity lower bound for strongly-convex-concave saddle point optimization. We show the superior performance of our methods in sparsity-inducing constrained optimization, notably Google's personalized PageRank problem. Furthermore, we show that a restarted version of the proposed methods can effectively identify the sparsity pattern of the optimal solution within a finite number of steps, a result that appears to have independent significance.  ( 2 min )
    Flat Seeking Bayesian Neural Networks. (arXiv:2302.02713v5 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferring a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with lower sharpness have better generalization ability. However, existing posterior inferences are not aware of sharpness/flatness in terms of formulation, possibly leading to high sharpness for the models sampled from them. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior, and the optimal approximate posterior estimating this sharpness-aware posterior, have better flatness, hence possibly possessing higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.  ( 2 min )
    Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v2 [cs.LG] UPDATED)
    Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.  ( 2 min )
    ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory. (arXiv:2302.08284v2 [cs.LG] UPDATED)
    DNA sequence classification is a fundamental task in computational biology with vast implications for applications such as disease prevention and drug design. Therefore, fast high-quality sequence classifiers are significantly important. This paper introduces ClaPIM, a scalable DNA sequence classification architecture based on the emerging concept of hybrid in-crossbar and near-crossbar memristive processing-in-memory (PIM). We enable efficient and high-quality classification by uniting the filter and search stages within a single algorithm. Specifically, we propose a custom filtering technique that drastically narrows the search space and a search approach that facilitates approximate string matching through a distance function. ClaPIM is the first PIM architecture for scalable approximate string matching that benefits from the high density of memristive crossbar arrays and the massive computational parallelism of PIM. Compared with Kraken2, a state-of-the-art software classifier, ClaPIM provides significantly higher classification quality (up to 20x improvement in F1 score) and also demonstrates a 1.8x throughput improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator that is restricted to small datasets, we observe both a 30.4x improvement in normalized throughput per area and a 7% increase in classification precision.  ( 2 min )
    Federated Learning and Meta Learning: Approaches, Applications, and Directions. (arXiv:2210.13111v2 [cs.LG] UPDATED)
    Over the past few years, significant advancements have been made in the field of machine learning (ML) to address resource management, interference management, autonomy, and decision-making in wireless networks. Traditional ML approaches rely on centralized methods, where data is collected at a central server for training. However, this approach poses a challenge in terms of preserving the data privacy of devices. To address this issue, federated learning (FL) has emerged as an effective solution that allows edge devices to collaboratively train ML models without compromising data privacy. In FL, local datasets are not shared, and the focus is on learning a global model for a specific task involving all devices. However, FL has limitations when it comes to adapting the model to devices with different data distributions. In such cases, meta learning is considered, as it enables the adaptation of learning models to different data distributions using only a few data samples. In this tutorial, we present a comprehensive review of FL, meta learning, and federated meta learning (FedMeta). Unlike other tutorial papers, our objective is to explore how FL, meta learning, and FedMeta methodologies can be designed, optimized, and evolved, and their applications over wireless networks. We also analyze the relationships among these learning algorithms and examine their advantages and disadvantages in real-world applications.  ( 3 min )
    Evaluating Language Models for Mathematics through Interactions. (arXiv:2306.01694v2 [cs.LG] UPDATED)
    There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs, and is insufficient for making an informed decision about which LLMs and under which assistive settings can they be sensibly used. Static assessment fails to account for the essential interactive element in LLM deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty respond well to user corrections, and are more interpretable and concise may constitute better assistants. Interactive evaluation is a promising way to navigate the capability of these models; humans should be aware of language models' algebraic fallibility and discern where they are appropriate to use.
    Rethinking the Power of Graph Canonization in Graph Representation Learning with Stability. (arXiv:2309.00738v2 [cs.LG] UPDATED)
    The expressivity of Graph Neural Networks (GNNs) has been studied broadly in recent years to reveal the design principles for more powerful GNNs. Graph canonization is known as a typical approach to distinguish non-isomorphic graphs, yet rarely adopted when developing expressive GNNs. This paper proposes to maximize the expressivity of GNNs by graph canonization, then the power of such GNNs is studies from the perspective of model stability. A stable GNN will map similar graphs to close graph representations in the vectorial space, and the stability of GNNs is critical to generalize their performance to unseen graphs. We theoretically reveal the trade-off of expressivity and stability in graph-canonization-enhanced GNNs. Then we introduce a notion of universal graph canonization as the general solution to address the trade-off and characterize a widely applicable sufficient condition to solve the universal graph canonization. A comprehensive set of experiments demonstrates the effectiveness of the proposed method. In many popular graph benchmark datasets, graph canonization successfully enhances GNNs and provides highly competitive performance, indicating the capability and great potential of proposed method in general graph representation learning. In graph datasets where the sufficient condition holds, GNNs enhanced by universal graph canonization consistently outperform GNN baselines and successfully improve the SOTA performance up to $31\%$, providing the optimal solution to numerous challenging real-world graph analytical tasks like gene network representation learning in bioinformatics.
    VQ-NeRF: Neural Reflectance Decomposition and Editing with Vector Quantization. (arXiv:2310.11864v2 [cs.CV] UPDATED)
    We propose VQ-NeRF, a two-branch neural network model that incorporates Vector Quantization (VQ) to decompose and edit reflectance fields in 3D scenes. Conventional neural reflectance fields use only continuous representations to model 3D scenes, despite the fact that objects are typically composed of discrete materials in reality. This lack of discretization can result in noisy material decomposition and complicated material editing. To address these limitations, our model consists of a continuous branch and a discrete branch. The continuous branch follows the conventional pipeline to predict decomposed materials, while the discrete branch uses the VQ mechanism to quantize continuous materials into individual ones. By discretizing the materials, our model can reduce noise in the decomposition process and generate a segmentation map of discrete materials. Specific materials can be easily selected for further editing by clicking on the corresponding area of the segmentation outcomes. Additionally, we propose a dropout-based VQ codeword ranking strategy to predict the number of materials in a scene, which reduces redundancy in the material segmentation process. To improve usability, we also develop an interactive interface to further assist material editing. We evaluate our model on both computer-generated and real-world scenes, demonstrating its superior performance. To the best of our knowledge, our model is the first to enable discrete material editing in 3D scenes.
    Equal Opportunity of Coverage in Fair Regression. (arXiv:2311.02243v1 [cs.LG])
    We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making. The seminal work of ``equalized coverage'' proposed an uncertainty-aware fairness notion. However, it does not guarantee equal coverage rates across more fine-grained groups (e.g., low-income females) conditioning on the true label and is biased in the assessment of uncertainty. To tackle these limitations, we propose a new uncertainty-aware fairness -- Equal Opportunity of Coverage (EOC) -- that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level. Further, the prediction intervals should be narrow to be informative. We propose Binned Fair Quantile Regression (BFQR), a distribution-free post-processing method to improve EOC with reasonable width for any trained ML models. It first calibrates a hold-out set to bound deviation from EOC, then leverages conformal prediction to maintain EOC on a test set, meanwhile optimizing prediction interval width. Experimental results demonstrate the effectiveness of our method in improving EOC. Our code is publicly available at https://github.com/fangxin-wang/bfqr .
    Differentiable Clustering with Perturbed Spanning Forests. (arXiv:2305.16358v3 [cs.LG] UPDATED)
    We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks.
    Practical Equivariances via Relational Conditional Neural Processes. (arXiv:2306.10915v2 [stat.ML] UPDATED)
    Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances -- for example to translation -- which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.
    Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features. (arXiv:2308.06197v2 [cs.CV] UPDATED)
    Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.
    PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference. (arXiv:2309.02334v2 [cs.LG] UPDATED)
    Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
    Self-Supervised Learning of Representations for Space Generates Multi-Modular Grid Cells. (arXiv:2311.02316v1 [cs.LG])
    To solve the spatial problems of mapping, localization and navigation, the mammalian lineage has developed striking spatial representations. One important spatial representation is the Nobel-prize winning grid cells: neurons that represent self-location, a local and aperiodic quantity, with seemingly bizarre non-local and spatially periodic activity patterns of a few discrete periods. Why has the mammalian lineage learnt this peculiar grid representation? Mathematical analysis suggests that this multi-periodic representation has excellent properties as an algebraic code with high capacity and intrinsic error-correction, but to date, there is no satisfactory synthesis of core principles that lead to multi-modular grid cells in deep recurrent neural networks. In this work, we begin by identifying key insights from four families of approaches to answering the grid cell question: coding theory, dynamical systems, function optimization and supervised deep learning. We then leverage our insights to propose a new approach that combines the strengths of all four approaches. Our approach is a self-supervised learning (SSL) framework - including data, data augmentations, loss functions and a network architecture - motivated from a normative perspective, without access to supervised position information or engineering of particular readout representations as needed in previous approaches. We show that multiple grid cell modules can emerge in networks trained on our SSL framework and that the networks and emergent representations generalize well outside their training distribution. This work contains insights for neuroscientists interested in the origins of grid cells as well as machine learning researchers interested in novel SSL frameworks.
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v3 [cs.LG] UPDATED)
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    FedZero: Leveraging Renewable Excess Energy in Federated Learning. (arXiv:2305.15092v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging machine learning technique that enables distributed model training across data silos or edge devices without data sharing. Yet, FL inevitably introduces inefficiencies compared to centralized model training, which will further increase the already high energy usage and associated carbon emissions of machine learning in the future. One idea to reduce FL's carbon footprint is to schedule training jobs based on the availability of renewable excess energy that can occur at certain times and places in the grid. However, in the presence of such volatile and unreliable resources, existing FL schedulers cannot always ensure fast, efficient, and fair trainings. We propose FedZero, an FL system that operates exclusively on renewable excess energy and spare capacity of compute infrastructure to effectively reduce a training's operational carbon emissions to zero. Using energy and load forecasts, FedZero leverages the spatio-temporal availability of excess resources by selecting clients for fast convergence and fair participation. Our evaluation, based on real solar and load traces, shows that FedZero converges significantly faster than existing approaches under the mentioned constraints while consuming less energy. Furthermore, it is robust to forecasting errors and scalable to tens of thousands of clients.
    Sparse Training of Discrete Diffusion Models for Graph Generation. (arXiv:2311.02142v1 [cs.LG])
    Generative models for graphs often encounter scalability challenges due to the inherent need to predict interactions for every node pair. Despite the sparsity often exhibited by real-world graphs, the unpredictable sparsity patterns of their adjacency matrices, stemming from their unordered nature, leads to quadratic computational complexity. In this work, we introduce SparseDiff, a denoising diffusion model for graph generation that is able to exploit sparsity during its training phase. At the core of SparseDiff is a message-passing neural network tailored to predict only a subset of edges during each forward pass. When combined with a sparsity-preserving noise model, this model can efficiently work with edge lists representations of graphs, paving the way for scalability to much larger structures. During the sampling phase, SparseDiff iteratively populates the adjacency matrix from its prior state, ensuring prediction of the full graph while controlling memory utilization. Experimental results show that SparseDiff simultaneously matches state-of-the-art in generation performance on both small and large graphs, highlighting the versatility of our method.
    Multi-task Learning for Optical Coherence Tomography Angiography (OCTA) Vessel Segmentation. (arXiv:2311.02266v1 [eess.IV])
    Optical Coherence Tomography Angiography (OCTA) is a non-invasive imaging technique that provides high-resolution cross-sectional images of the retina, which are useful for diagnosing and monitoring various retinal diseases. However, manual segmentation of OCTA images is a time-consuming and labor-intensive task, which motivates the development of automated segmentation methods. In this paper, we propose a novel multi-task learning method for OCTA segmentation, called OCTA-MTL, that leverages an image-to-DT (Distance Transform) branch and an adaptive loss combination strategy. The image-to-DT branch predicts the distance from each vessel voxel to the vessel surface, which can provide useful shape prior and boundary information for the segmentation task. The adaptive loss combination strategy dynamically adjusts the loss weights according to the inverse of the average loss values of each task, to balance the learning process and avoid the dominance of one task over the other. We evaluate our method on the ROSE-2 dataset its superiority in terms of segmentation performance against two baseline methods: a single-task segmentation method and a multi-task segmentation method with a fixed loss combination.
    Gacs-Korner Common Information Variational Autoencoder. (arXiv:2205.12239v2 [cs.LG] UPDATED)
    We propose a notion of common information that allows one to quantify and separate the information that is shared between two random variables from the information that is unique to each. Our notion of common information is defined by an optimization problem over a family of functions and recovers the G\'acs-K\"orner common information as a special case. Importantly, our notion can be approximated empirically using samples from the underlying data distribution. We then provide a method to partition and quantify the common and unique information using a simple modification of a traditional variational auto-encoder. Empirically, we demonstrate that our formulation allows us to learn semantically meaningful common and unique factors of variation even on high-dimensional data such as images and videos. Moreover, on datasets where ground-truth latent factors are known, we show that we can accurately quantify the common information between the random variables.
    A Theory of Unsupervised Translation Motivated by Understanding Animal Communication. (arXiv:2211.11081v2 [cs.CL] UPDATED)
    Neural networks are capable of translating between languages -- in some cases even between two languages where there is little or no access to parallel translations, in what is known as Unsupervised Machine Translation (UMT). Given this progress, it is intriguing to ask whether machine learning tools can ultimately enable understanding animal communication, particularly that of highly intelligent animals. We propose a theoretical framework for analyzing UMT when no parallel translations are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure. We exemplify this theory with two stylized models of language, for which our framework provides bounds on necessary sample complexity; the bounds are formally proven and experimentally verified on synthetic data. These bounds show that the error rates are inversely related to the language complexity and amount of common ground. This suggests that unsupervised translation of animal communication may be feasible if the communication system is sufficiently complex.
    Unified Out-Of-Distribution Detection: A Model-Specific Perspective. (arXiv:2304.06813v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection aims to identify test examples that do not belong to the training distribution and are thus unlikely to be predicted reliably. Despite a plethora of existing works, most of them focused only on the scenario where OOD examples come from semantic shift (e.g., unseen categories), ignoring other possible causes (e.g., covariate shift). In this paper, we present a novel, unifying framework to study OOD detection in a broader scope. Instead of detecting OOD examples from a particular cause, we propose to detect examples that a deployed machine learning model (e.g., an image classifier) is unable to predict correctly. That is, whether a test example should be detected and rejected or not is ``model-specific''. We show that this framework unifies the detection of OOD examples caused by semantic shift and covariate shift, and closely addresses the concern of applying a machine learning model to uncontrolled environments. We provide an extensive analysis that involves a variety of models (e.g., different architectures and training strategies), sources of OOD examples, and OOD detection approaches, and reveal several insights into improving and understanding OOD detection in uncontrolled environments.
    Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design. (arXiv:2310.05764v2 [cs.LG] UPDATED)
    A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon state-of-the-art generative processes for docking in simplicity, generality, and average sample quality in pocket-level docking. Enabled by this structure modeling, FlowSite designs binding sites substantially better than baseline approaches.
    Monte Carlo is a good sampling strategy for polynomial approximation in high dimensions. (arXiv:2208.09045v3 [math.NA] UPDATED)
    This paper concerns the approximation of smooth, high-dimensional functions from limited samples using polynomials. This task lies at the heart of many applications in computational science and engineering - notably, some of those arising from parametric modelling and computational uncertainty quantification. It is common to use Monte Carlo sampling in such applications, so as not to succumb to the curse of dimensionality. However, it is well known that such a strategy is theoretically suboptimal. Specifically, there are many polynomial spaces of dimension $n$ for which the sample complexity scales log-quadratically, i.e., like $c \cdot n^2 \cdot \log(n)$ as $n \rightarrow \infty$. This well-documented phenomenon has led to a concerted effort over the last decade to design improved, and moreover, near-optimal strategies, whose sample complexities scale log-linearly, or even linearly in $n$. In this work we demonstrate that Monte Carlo is actually a perfectly good strategy in high dimensions, despite its apparent suboptimality. We first document this phenomenon empirically via a systematic set of numerical experiments. Next, we present a theoretical analysis that rigorously justifies this fact in the case of holomorphic functions of infinitely-many variables. We show that there is a least-squares approximation based on $m$ Monte Carlo samples whose error decays algebraically fast in $m/\log(m)$, with a rate that is the same as that of the best $n$-term polynomial approximation. This result is non-constructive, since it assumes knowledge of a suitable polynomial subspace in which to perform the approximation. We next present a compressed sensing-based scheme that achieves the same rate, except for a larger polylogarithmic factor. This scheme is practical, and numerically it performs as well as or better than well-known adaptive least-squares schemes.
    Learning a Consensus Sub-Network with Polarization Regularization and One Pass Training. (arXiv:2302.10798v4 [cs.LG] UPDATED)
    The subject of green AI has been gaining attention within the deep learning community given the recent trend of ever larger and more complex neural network models. Existing solutions for reducing the computational load of training at inference time usually involve pruning the network parameters. Pruning schemes often create extra overhead either by iterative training and fine-tuning for static pruning or repeated computation of a dynamic pruning graph. We propose a new parameter pruning strategy for learning a lighter-weight sub-network that minimizes the energy cost while maintaining comparable performance to the fully parameterised network on given downstream tasks. Our proposed pruning scheme is green-oriented, as it only requires a one-off training to discover the optimal static sub-networks by dynamic pruning methods. The pruning scheme consists of a binary gating module and a novel loss function to uncover sub-networks with user-defined sparsity. Our method enables pruning and training simultaneously, which saves energy in both the training and inference phases and avoids extra computational overhead from gating modules at inference time. Our results on CIFAR-10 and CIFAR-100 suggest that our scheme can remove 50% of connections in deep networks with less than 1% reduction in classification accuracy. Compared to other related pruning methods, our method demonstrates a lower drop in accuracy for equivalent reductions in computational cost.
    Interpretability is not Explainability: New Quantitative XAI Approach with a focus on Recommender Systems in Education. (arXiv:2311.02078v1 [cs.IR])
    The field of eXplainable Artificial Intelligence faces challenges due to the absence of a widely accepted taxonomy that facilitates the quantitative evaluation of explainability in Machine Learning algorithms. In this paper, we propose a novel taxonomy that addresses the current gap in the literature by providing a clear and unambiguous understanding of the key concepts and relationships in XAI. Our approach is rooted in a systematic analysis of existing definitions and frameworks, with a focus on transparency, interpretability, completeness, complexity and understandability as essential dimensions of explainability. This comprehensive taxonomy aims to establish a shared vocabulary for future research. To demonstrate the utility of our proposed taxonomy, we examine a case study of a Recommender System designed to curate and recommend the most suitable online resources from MERLOT. By employing the SHAP package, we quantify and enhance the explainability of the RS within the context of our newly developed taxonomy.
    Performance evaluation of deep segmentation models for Contrails detection. (arXiv:2211.14851v4 [cs.CV] UPDATED)
    Contrails, short for condensation trails, are line-shaped ice clouds produced by aircraft engine exhaust when they fly through cold and humid air. They generate a greenhouse effect by absorbing or directing back to Earth approximately 33% of emitted outgoing longwave radiation. They account for over half of the climate change resulting from aviation activities. Avoiding contrails and adjusting flight routes could be an inexpensive and effective way to reduce their impact. An accurate, automated, and reliable detection algorithm is required to develop and evaluate contrail avoidance strategies. Advancement in contrail detection has been severely limited due to several factors, primarily due to a lack of quality-labeled data. Recently, proposed a large human-labeled Landsat-8 contrails dataset. Each contrail is carefully labeled with various inputs in various scenes of Landsat-8 satellite imagery. In this work, we benchmark several popular segmentation models with combinations of different loss functions and encoder backbones. This work is the first to apply state-of-the-art segmentation techniques to detect contrails in low-orbit satellite imagery. Our work can also be used as an open benchmark for contrail segmentation and is publicly available.
    Making Harmful Behaviors Unlearnable for Large Language Models. (arXiv:2311.02105v1 [cs.LG])
    Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. To meet the requirements of different applications, LLMs are often customized by further fine-tuning. However, the powerful learning ability of LLMs not only enables them to acquire new tasks but also makes them susceptible to learning undesired behaviors. For example, even safety-aligned LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. Can we train LLMs on harmful data without learning harmful behaviors? This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process. Specifically, we introduce ``security vectors'', a few new parameters that can be separated from the LLM, to ensure LLM's responses are consistent with the harmful behavior. Security vectors are activated during fine-tuning, the consistent behavior makes LLM believe that such behavior has already been learned, there is no need to further optimize for harmful data. During inference, we can deactivate security vectors to restore the LLM's normal behavior. The experimental results show that the security vectors generated by 100 harmful samples are enough to prevent LLM from learning 1000 harmful samples, while preserving the ability to learn other useful information.
    The Alignment Problem in Context. (arXiv:2311.02147v1 [cs.LG])
    A core challenge in the development of increasingly capable AI systems is to make them safe and reliable by ensuring their behaviour is consistent with human values. This challenge, known as the alignment problem, does not merely apply to hypothetical future AI systems that may pose catastrophic risks; it already applies to current systems, such as large language models, whose potential for harm is rapidly increasing. In this paper, I assess whether we are on track to solve the alignment problem for large language models, and what that means for the safety of future AI systems. I argue that existing strategies for alignment are insufficient, because large language models remain vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. I offer an explanation of this lingering vulnerability on which it is not simply a contingent limitation of current language models, but has deep technical ties to a crucial aspect of what makes these models useful and versatile in the first place -- namely, their remarkable aptitude to learn "in context" directly from user instructions. It follows that the alignment problem is not only unsolved for current AI systems, but may be intrinsically difficult to solve without severely undermining their capabilities. Furthermore, this assessment raises concerns about the prospect of ensuring the safety of future and more capable AI systems.
    Distributed Machine Learning in D2D-Enabled Heterogeneous Networks: Architectures, Performance, and Open Challenges. (arXiv:2206.01906v2 [cs.LG] UPDATED)
    The ever-growing concerns regarding data privacy have led to a paradigm shift in machine learning (ML) architectures from centralized to distributed approaches, giving rise to federated learning (FL) and split learning (SL) as the two predominant privacy-preserving ML mechanisms. However,implementing FL or SL in device-to-device (D2D)-enabled heterogeneous networks with diverse clients presents substantial challenges, including architecture scalability and prolonged training delays. To address these challenges, this article introduces two innovative hybrid distributed ML architectures, namely, hybrid split FL (HSFL) and hybrid federated SL (HFSL). Such architectures combine the strengths of both FL and SL in D2D-enabled heterogeneous wireless networks. We provide a comprehensive analysis of the performance and advantages of HSFL and HFSL, while also highlighting open challenges for future exploration. We support our proposals with preliminary simulations using three datasets in non-independent and non-identically distributed settings, demonstrating the feasibility of our architectures. Our simulations reveal notable reductions in communication/computation costs and training delays as compared to conventional FL and SL.
    Towards model-free RL algorithms that scale well with unstructured data. (arXiv:2311.02215v1 [cs.LG])
    Conventional reinforcement learning (RL) algorithms exhibit broad generality in their theoretical formulation and high performance on several challenging domains when combined with powerful function approximation. However, developing RL algorithms that perform well across problems with unstructured observations at scale remains challenging because most function approximation methods rely on externally provisioned knowledge about the structure of the input for good performance (e.g. convolutional networks, graph neural networks, tile-coding). A common practice in RL is to evaluate algorithms on a single problem, or on problems with limited variation in the observation scale. RL practitioners lack a systematic way to study how well a single RL algorithm performs when instantiated across a range of problem scales, and they lack function approximation techniques that scale well with unstructured observations. We address these limitations by providing environments and algorithms to study scaling for unstructured observation vectors and flat action spaces. We introduce a family of combinatorial RL problems with an exponentially large state space and high-dimensional dynamics but where linear computation is sufficient to learn a (nonlinear) value function estimate for performant control. We provide an algorithm that constructs reward-relevant general value function (GVF) questions to find and exploit predictive structure directly from the experience stream. In an empirical evaluation of the approach on synthetic problems, we observe a sample complexity that scales linearly with the observation size. The proposed algorithm reliably outperforms a conventional deep RL algorithm on these scaling problems, and they exhibit several desirable auxiliary properties. These results suggest new algorithmic mechanisms by which algorithms can learn at scale from unstructured data.
    Machine learning the interaction network in coupled dynamical systems. (arXiv:2310.03378v2 [math.DS] UPDATED)
    The study of interacting dynamical systems continues to attract research interest in various fields of science and engineering. In a collection of interacting particles, the interaction network contains information about how various components interact with one another. Inferring the information about the interaction network from the dynamics of agents is a problem of long-standing interest. In this work, we employ a self-supervised neural network model to achieve two outcomes: to recover the interaction network and to predict the dynamics of individual agents. Both these information are inferred solely from the observed trajectory data. This work presents an application of the Neural Relational Inference model to two dynamical systems: coupled particles mediated by Hooke's law interaction and coupled phase (Kuramoto) oscillators.
    A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity. (arXiv:2204.10806v3 [cs.HC] UPDATED)
    Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments
    Discrete neural nets and polymorphic learning. (arXiv:2308.00677v2 [cs.NE] UPDATED)
    Theorems from universal algebra such as that of Murski\u{i} from the 1970s have a striking similarity to universal approximation results for neural nets along the lines of Cybenko's from the 1980s. We consider here a discrete analogue of the classical notion of a neural net which places these results in a unified setting. We introduce a learning algorithm based on polymorphisms of relational structures and show how to use it for a classical learning task.
    Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand. (arXiv:2310.14942v2 [cs.CV] UPDATED)
    The prosperity of deep neural networks (DNNs) is largely benefited from open-source datasets, based on which users can evaluate and improve their methods. In this paper, we revisit backdoor-based dataset ownership verification (DOV), which is currently the only feasible approach to protect the copyright of open-source datasets. We reveal that these methods are fundamentally harmful given that they could introduce malicious misclassification behaviors to watermarked DNNs by the adversaries. In this paper, we design DOV from another perspective by making watermarked models (trained on the protected dataset) correctly classify some `hard' samples that will be misclassified by the benign model. Our method is inspired by the generalization property of DNNs, where we find a \emph{hardly-generalized domain} for the original dataset (as its \emph{domain watermark}). It can be easily learned with the protected dataset containing modified samples. Specifically, we formulate the domain generation as a bi-level optimization and propose to optimize a set of visually-indistinguishable clean-label modified data with similar effects to domain-watermarked samples from the hardly-generalized domain to ensure watermark stealthiness. We also design a hypothesis-test-guided ownership verification via our domain watermark and provide the theoretical analyses of our method. Extensive experiments on three benchmark datasets are conducted, which verify the effectiveness of our method and its resistance to potential adaptive methods. The code for reproducing main experiments is available at \url{https://github.com/JunfengGo/Domain-Watermark}.
    Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications. (arXiv:2309.17076v2 [eess.IV] UPDATED)
    3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D mesh segmentation. We analyze the problem of 3D mesh segmentation for pathological vessel structures (aneurysms) and conventional anatomical structures (endocardium and epicardium of ventricles). Local geometrical features are encoded as sampling from the signed distance function, and the neural network performs prediction for each mesh node. We show that weight symmetry gains from 1 to 3% of additional accuracy and allows decreasing the number of trainable parameters up to 8 times without suffering the performance loss if neural networks have at least three convolutional layers. This also works for very small training sets.
    MANER: Multi-Agent Neural Rearrangement Planning of Objects in Cluttered Environments. (arXiv:2306.06543v2 [cs.RO] UPDATED)
    Object rearrangement is a fundamental problem in robotics with various practical applications ranging from managing warehouses to cleaning and organizing home kitchens. While existing research has primarily focused on single-agent solutions, real-world scenarios often require multiple robots to work together on rearrangement tasks. This paper proposes a comprehensive learning-based framework for multi-agent object rearrangement planning, addressing the challenges of task sequencing and path planning in complex environments. The proposed method iteratively selects objects, determines their relocation regions, and pairs them with available robots under kinematic feasibility and task reachability for execution to achieve the target arrangement. Our experiments on a diverse range of simulated and real-world environments demonstrate the effectiveness and robustness of the proposed framework. Furthermore, results indicate improved performance in terms of traversal time and success rate compared to baseline approaches.
    Fine-Tune Language Models as Differential Equation Solvers. (arXiv:2308.05061v2 [cs.LG] UPDATED)
    In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in learning operators and solving differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data, may inadvertently overlook the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly improves in-context operator learning, but also creates a new path for the application of language models.
    Sampling via Gradient Flows in the Space of Probability Measures. (arXiv:2310.03597v2 [stat.ML] UPDATED)
    Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.
    The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI. (arXiv:2310.16787v3 [cs.CL] UPDATED)
    The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.
    RADIO: Reference-Agnostic Dubbing Video Synthesis. (arXiv:2309.01950v2 [cs.CV] UPDATED)
    One of the most challenging problems in audio-driven talking head generation is achieving high-fidelity detail while ensuring precise synchronization. Given only a single reference image, extracting meaningful identity attributes becomes even more challenging, often causing the network to mirror the facial and lip structures too closely. To address these issues, we introduce RADIO, a framework engineered to yield high-quality dubbed videos regardless of the pose or expression in reference images. The key is to modulate the decoder layers using latent space composed of audio and reference features. Additionally, we incorporate ViT blocks into the decoder to emphasize high-fidelity details, especially in the lip region. Our experimental results demonstrate that RADIO displays high synchronization without the loss of fidelity. Especially in harsh scenarios where the reference frame deviates significantly from the ground truth, our method outperforms state-of-the-art methods, highlighting its robustness.
    Data Filtering Networks. (arXiv:2309.17425v3 [cs.AI] UPDATED)
    Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
    Feature selection and regression methods for stock price prediction using technical indicators. (arXiv:2310.09903v4 [q-fin.ST] UPDATED)
    Due to the influence of many factors, including technical indicators on stock price prediction, feature selection is important to choose the best indicators. This study uses technical indicators and features selection and regression methods to solve the problem of closing the stock market price. The aim of this research is to predict the stock market price with the least error. By the proposed method, the data created by the 3-day time window were converted to the appropriate input for regression methods. In this paper, 10 regressor and 123 technical indicators have been examined on data of the last 13 years of Apple Company. The results have been investigated by 5 error-based evaluation criteria. Based on results of the proposed method, MLPSF has 56/47% better performance than MLP. Also, SVRSF has 67/42% improved compared to SVR. LRSF was 76.7 % improved compared to LR. The RISF method also improved 72.82 % of Ridge regression. The DTRSB method had 24.23 % improvement over DTR. KNNSB had 15.52 % improvement over KNN regression. RFSB had a 6 % improvement over RF. GBRSF also improved at 7% over GBR. Finally, ADASF and ADASB also had a 4% improvement over the ADA regression. Also, Ridge and LinearRegression had the best results for stock price prediction. Based on results, the best indicators to predict stock price are: the Squeeze_pro, Percentage Price Oscillator, Thermo, Decay, Archer On-Balance Volume, Bollinger Bands, Squeeze and Ichimoku indicator. According to the results, the use of suitable combination of suggested indicators along with regression methods has resulted in high accuracy in predicting the closing price.
    GInX-Eval: Towards In-Distribution Evaluation of Graph Neural Network Explanations. (arXiv:2309.16223v2 [cs.AI] UPDATED)
    Diverse explainability methods of graph neural networks (GNN) have recently been developed to highlight the edges and nodes in the graph that contribute the most to the model predictions. However, it is not clear yet how to evaluate the correctness of those explanations, whether it is from a human or a model perspective. One unaddressed bottleneck in the current evaluation procedure is the problem of out-of-distribution explanations, whose distribution differs from those of the training data. This important issue affects existing evaluation metrics such as the popular faithfulness or fidelity score. In this paper, we show the limitations of faithfulness metrics. We propose GInX-Eval (Graph In-distribution eXplanation Evaluation), an evaluation procedure of graph explanations that overcomes the pitfalls of faithfulness and offers new insights on explainability methods. Using a fine-tuning strategy, the GInX score measures how informative removed edges are for the model and the EdgeRank score evaluates if explanatory edges are correctly ordered by their importance. GInX-Eval verifies if ground-truth explanations are instructive to the GNN model. In addition, it shows that many popular methods, including gradient-based methods, produce explanations that are not better than a random designation of edges as important subgraphs, challenging the findings of current works in the area. Results with GInX-Eval are consistent across multiple datasets and align with human evaluation.
    Non-parametric Conditional Independence Testing for Mixed Continuous-Categorical Variables: A Novel Method and Numerical Evaluation. (arXiv:2310.11132v2 [cs.LG] UPDATED)
    Conditional independence testing (CIT) is a common task in machine learning, e.g., for variable selection, and a main component of constraint-based causal discovery. While most current CIT approaches assume that all variables are numerical or all variables are categorical, many real-world applications involve mixed-type datasets that include numerical and categorical variables. Non-parametric CIT can be conducted using conditional mutual information (CMI) estimators combined with a local permutation scheme. Recently, two novel CMI estimators for mixed-type datasets based on k-nearest-neighbors (k-NN) have been proposed. As with any k-NN method, these estimators rely on the definition of a distance metric. One approach computes distances by a one-hot encoding of the categorical variables, essentially treating categorical variables as discrete-numerical, while the other expresses CMI by entropy terms where the categorical variables appear as conditions only. In this work, we study these estimators and propose a variation of the former approach that does not treat categorical variables as numeric. Our numerical experiments show that our variant detects dependencies more robustly across different data distributions and preprocessing types.
    Efficient Robust Bayesian Optimization for Arbitrary Uncertain Inputs. (arXiv:2310.20145v2 [cs.LG] UPDATED)
    Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in the final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nystrom approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve state-of-the-art performance.
    Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning. (arXiv:2308.09544v3 [cs.LG] UPDATED)
    In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL models. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks. The source code for our method is available at https://github.com/fszatkowski/cl-teacher-adaptation.
    AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents. (arXiv:2310.09971v2 [cs.LG] UPDATED)
    We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is uniquely scalable and applicable to a wide range of problems. We demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a novel hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments. We evaluate our agent on three goal-conditioned domains and study how its individual improvements connect to create a generalist policy.
    On existence, uniqueness and scalability of adversarial robustness measures for AI classifiers. (arXiv:2310.14421v2 [stat.ML] UPDATED)
    Simply-verifiable mathematical conditions for existence, uniqueness and explicit analytical computation of minimal adversarial paths (MAP) and minimal adversarial distances (MAD) for (locally) uniquely-invertible classifiers, for generalized linear models (GLM), and for entropic AI (EAI) are formulated and proven. Practical computation of MAP and MAD, their comparison and interpretations for various classes of AI tools (for neuronal networks, boosted random forests, GLM and EAI) are demonstrated on the common synthetic benchmarks: on a double Swiss roll spiral and its extensions, as well as on the two biomedical data problems (for the health insurance claim predictions, and for the heart attack lethality classification). On biomedical applications it is demonstrated how MAP provides unique minimal patient-specific risk-mitigating interventions in the predefined subsets of accessible control variables.
    Early detection of inflammatory arthritis to improve referrals using multimodal machine learning from blood testing, semi-structured and unstructured patient records. (arXiv:2310.19967v2 [cs.LG] UPDATED)
    Early detection of inflammatory arthritis (IA) is critical to efficient and accurate hospital referral triage for timely treatment and preventing the deterioration of the IA disease course, especially under limited healthcare resources. The manual assessment process is the most common approach in practice for the early detection of IA, but it is extremely labor-intensive and inefficient. A large amount of clinical information needs to be assessed for every referral from General Practice (GP) to the hospitals. Machine learning shows great potential in automating repetitive assessment tasks and providing decision support for the early detection of IA. However, most machine learning-based methods for IA detection rely on blood testing results. But in practice, blood testing data is not always available at the point of referrals, so we need methods to leverage multimodal data such as semi-structured and unstructured data for early detection of IA. In this research, we present fusion and ensemble learning-based methods using multimodal data to assist decision-making in the early detection of IA, and a conformal prediction-based method to quantify the uncertainty of the prediction and detect any unreliable predictions. To the best of our knowledge, our study is the first attempt to utilize multimodal data to support the early detection of IA from GP referrals.
    De Novo Drug Design with Joint Transformers. (arXiv:2310.02066v2 [cs.LG] UPDATED)
    De novo drug design requires simultaneously generating novel molecules outside of training data and predicting their target properties, making it a hard task for generative models. To address this, we propose Joint Transformer that combines a Transformer decoder, a Transformer encoder, and a predictor in a joint generative model with shared weights. We show that training the model with a penalized log-likelihood objective results in state-of-the-art performance in molecule generation, while decreasing the prediction error on newly sampled molecules, as compared to a fine-tuned decoder-only Transformer, by 42%. Finally, we propose a probabilistic black-box optimization algorithm that employs Joint Transformer to generate novel molecules with improved target properties, as compared to the training data, outperforming other SMILES-based optimization methods in de novo drug design.
    Causality and Independence Enhancement for Biased Node Classification. (arXiv:2310.09586v2 [cs.LG] UPDATED)
    Most existing methods that address out-of-distribution (OOD) generalization for node classification on graphs primarily focus on a specific type of data biases, such as label selection bias or structural bias. However, anticipating the type of bias in advance is extremely challenging, and designing models solely for one specific type may not necessarily improve overall generalization performance. Moreover, limited research has focused on the impact of mixed biases, which are more prevalent and demanding in real-world scenarios. To address these limitations, we propose a novel Causality and Independence Enhancement (CIE) framework, applicable to various graph neural networks (GNNs). Our approach estimates causal and spurious features at the node representation level and mitigates the influence of spurious correlations through the backdoor adjustment. Meanwhile, independence constraint is introduced to improve the discriminability and stability of causal and spurious features in complex biased environments. Essentially, CIE eliminates different types of data biases from a unified perspective, without the need to design separate methods for each bias as before. To evaluate the performance under specific types of data biases, mixed biases, and low-resource scenarios, we conducted comprehensive experiments on five publicly available datasets. Experimental results demonstrate that our approach CIE not only significantly enhances the performance of GNNs but outperforms state-of-the-art debiased node classification methods.
    Online covariance estimation for stochastic gradient descent under Markovian sampling. (arXiv:2308.01481v2 [math.ST] UPDATED)
    We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. Convergence rates of order $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ are established under state-dependent and state-independent Markovian sampling, respectively, where $d$ is the dimensionality and $n$ denotes observations or SGD iterations. These rates match the best-known convergence rate for independent and identically distributed (i.i.d) data. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. Numerical illustrations provide confidence intervals for SGD in linear and logistic regression models under Markovian sampling. Additionally, our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.
    PRE: Vision-Language Prompt Learning with Reparameterization Encoder. (arXiv:2309.07760v2 [cs.CV] UPDATED)
    Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
    AI Increases Global Access to Reliable Flood Forecasts. (arXiv:2307.16104v4 [cs.LG] UPDATED)
    Floods are one of the most common natural disasters, with a disproportionate impact in developing countries that often lack dense streamflow gauge networks. Accurate and timely warnings are critical for mitigating flood risks, but hydrological simulation models typically must be calibrated to long data records in each watershed. Using AI, we achieve reliability in predicting extreme riverine events in ungauged watersheds at up to a 5-day lead time that is similar to or better than the reliability of nowcasts (0-day lead time) from a current state of the art global modeling system (the Copernicus Emergency Management Service Global Flood Awareness System). Additionally, we achieve accuracies over 5-year return period events that are similar to or better than current accuracies over 1-year return period events. This means that AI can provide flood warnings earlier and over larger and more impactful events in ungauged basins. The model developed in this paper was incorporated into an operational early warning system that produces publicly available (free and open) forecasts in real time in over 80 countries. This work highlights a need for increasing the availability of hydrological data to continue to improve global access to reliable flood warnings.
    SBSM-Pro: Support Bio-sequence Machine for Proteins. (arXiv:2308.10275v2 [q-bio.QM] UPDATED)
    Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We introduce the Support Bio-Sequence Machine for Proteins (SBSM-Pro), a model purpose-built for the classification of biological sequences. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel multiple kernel learning (MKL) approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across ten datasets in terms of the identification of protein function and posttranslational modification. This research not only exemplifies state-of-the-art work in protein classification but also paves avenues for new directions in this domain, representing a beneficial endeavor in the development of platforms tailored for the classification of biological sequences. SBSM-Pro is available for access at this http URL
    DeepACO: Neural-enhanced Ant Systems for Combinatorial Optimization. (arXiv:2309.14032v2 [cs.NE] UPDATED)
    Ant Colony Optimization (ACO) is a meta-heuristic algorithm that has been successfully applied to various Combinatorial Optimization Problems (COPs). Traditionally, customizing ACO for a specific problem requires the expert design of knowledge-driven heuristics. In this paper, we propose DeepACO, a generic framework that leverages deep reinforcement learning to automate heuristic designs. DeepACO serves to strengthen the heuristic measures of existing ACO algorithms and dispense with laborious manual design in future ACO applications. As a neural-enhanced meta-heuristic, DeepACO consistently outperforms its ACO counterparts on eight COPs using a single neural architecture and a single set of hyperparameters. As a Neural Combinatorial Optimization method, DeepACO performs better than or on par with problem-specific methods on canonical routing problems. Our code is publicly available at https://github.com/henry-yeh/DeepACO.
    Towards Robust Cardiac Segmentation using Graph Convolutional Networks. (arXiv:2310.01210v4 [eess.IV] UPDATED)
    Fully automatic cardiac segmentation can be a fast and reproducible method to extract clinical measurements from an echocardiography examination. The U-Net architecture is the current state-of-the-art deep learning architecture for medical segmentation and can segment cardiac structures in real-time with average errors comparable to inter-observer variability. However, this architecture still generates large outliers that are often anatomically incorrect. This work uses the concept of graph convolutional neural networks that predict the contour points of the structures of interest instead of labeling each pixel. We propose a graph architecture that uses two convolutional rings based on cardiac anatomy and show that this eliminates anatomical incorrect multi-structure segmentations on the publicly available CAMUS dataset. Additionally, this work contributes with an ablation study on the graph convolutional architecture and an evaluation of clinical measurements on the clinical HUNT4 dataset. Finally, we propose to use the inter-model agreement of the U-Net and the graph network as a predictor of both the input and segmentation quality. We show this predictor can detect out-of-distribution and unsuitable input images in real-time. Source code is available online: https://github.com/gillesvntnu/GCN_multistructure
    Active Learning for Semantic Segmentation with Multi-class Label Query. (arXiv:2309.09319v2 [cs.CV] UPDATED)
    This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training as it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperforms previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost. Our code and results are available at https://github.com/sehyun03/MulActSeg.
    MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks. (arXiv:2309.14118v2 [cs.LG] UPDATED)
    Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in parallel, which not only limits their interpretability but also creates a dependency on modality availability. We present MultiModN, a multimodal, modular network that fuses latent representations in a sequence of any number, combination, or type of modality while providing granular real-time predictive feedback on any number or combination of predictive tasks. MultiModN's composable pipeline is interpretable-by-design, as well as innately multi-task and robust to the fundamental issue of biased missingness. We perform four experiments on several benchmark MM datasets across 10 real-world tasks (predicting medical diagnoses, academic performance, and weather), and show that MultiModN's sequential MM fusion does not compromise performance compared with a baseline of parallel fusion. By simulating the challenging bias of missing not-at-random (MNAR), this work shows that, contrary to MultiModN, parallel fusion baselines erroneously learn MNAR and suffer catastrophic failure when faced with different patterns of MNAR at inference. To the best of our knowledge, this is the first inherently MNAR-resistant approach to MM modeling. In conclusion, MultiModN provides granular insights, robustness, and flexibility without compromising performance.
    RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels. (arXiv:2310.03912v4 [cs.LG] UPDATED)
    Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.
    A Nonlinear Method for time series forecasting using VMD-GARCH-LSTM model. (arXiv:2310.08812v2 [stat.ME] UPDATED)
    Time series forecasting represents a significant and challenging task across various fields. Recently, methods based on mode decomposition have dominated the forecasting of complex time series because of the advantages of capturing local characteristics and extracting intrinsic modes from data. Unfortunately, most models fail to capture the implied volatilities that contain significant information. To enhance the forecasting of current, rapidly evolving, and volatile time series, we propose a novel decomposition-ensemble paradigm, the VMD-LSTM-GARCH model. The Variational Mode Decomposition algorithm is employed to decompose the time series into K sub-modes. Subsequently, the GARCH model extracts the volatility information from these sub-modes, which serve as the input for the LSTM. The numerical and volatility information of each sub-mode is utilized to train a Long Short-Term Memory network. This network predicts the sub-mode, and then we aggregate the predictions from all sub-modes to produce the output. By integrating econometric and artificial intelligence methods, and taking into account both the numerical and volatility information of the time series, our proposed model demonstrates superior performance in time series forecasting, as evidenced by the significant decrease in MSE, RMSE, and MAPE in our comparative experimental results.
    An Online Multiple Kernel Parallelizable Learning Scheme. (arXiv:2308.10101v2 [cs.LG] UPDATED)
    The performance of reproducing kernel Hilbert space-based methods is known to be sensitive to the choice of the reproducing kernel. Choosing an adequate reproducing kernel can be challenging and computationally demanding, especially in data-rich tasks without prior information about the solution domain. In this paper, we propose a learning scheme that scalably combines several single kernel-based online methods to reduce the kernel-selection bias. The proposed learning scheme applies to any task formulated as a regularized empirical risk minimization convex problem. More specifically, our learning scheme is based on a multi-kernel learning formulation that can be applied to widen any single-kernel solution space, thus increasing the possibility of finding higher-performance solutions. In addition, it is parallelizable, allowing for the distribution of the computational load across different computing units. We show experimentally that the proposed learning scheme outperforms the combined single-kernel online methods separately in terms of the cumulative regularized least squares cost metric.
    Optimal data pooling for shared learning in maintenance operations. (arXiv:2308.12670v2 [cs.LG] UPDATED)
    We study optimal data pooling for shared learning in two common maintenance operations: condition-based maintenance and spare parts management. We consider a set of systems subject to Poisson input -- the degradation or demand process -- that are coupled through an a-priori unknown rate. Decision problems involving these systems are high-dimensional Markov decision processes (MDPs) and hence notoriously difficult to solve. We present a decomposition result that reduces such an MDP to two-dimensional MDPs, enabling structural analyses and computations. Leveraging this decomposition, we (i) demonstrate that pooling data can lead to significant cost reductions compared to not pooling, and (ii) show that the optimal policy for the condition-based maintenance problem is a control limit policy, while for the spare parts management problem, it is an order-up-to level policy, both dependent on the pooled data.
    Spatial-frequency channels, shape bias, and adversarial robustness. (arXiv:2309.13190v2 [cs.LG] UPDATED)
    What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel") that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. Unlike humans, the neural network channel is very broad, 2-4 times wider than the human channel. Thus, noise at certain high and low frequencies will impair network performance and spare human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (51% variance explained) and robustness of adversarially-trained networks (66% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further beyond the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only makes it worse. Networks with narrower channels might be more robust.
    DEDUCE: Multi-head attention decoupled contrastive learning to discover cancer subtypes based on multi-omics data. (arXiv:2307.04075v2 [cs.LG] UPDATED)
    Due to the high heterogeneity and clinical characteristics of cancer, there are significant differences in multi-omics data and clinical features among subtypes of different cancers. Therefore, the identification and discovery of cancer subtypes are crucial for the diagnosis, treatment, and prognosis of cancer. In this study, we proposed a generalization framework based on attention mechanisms for unsupervised contrastive learning to analyze cancer multi-omics data for the identification and characterization of cancer subtypes. The framework contains a symmetric unsupervised multi-head attention encoder, which can deeply extract contextual features and long-range dependencies of multi-omics data, reducing the impact of noise in multi-omics data. Importantly, the proposed framework includes a decoupled contrastive learning model (DEDUCE) based on a multi-head attention mechanism to learn multi-omics data features and clustering and identify cancer subtypes. This method clusters subtypes by calculating the similarity between samples in the feature space and sample space of multi-omics data. The basic idea is to decouple different attributes of multi-omics data features and learn them as contrasting terms. Construct a contrastive loss function to measure the difference between positive examples and negative examples, and minimize this difference, thereby encouraging the model to learn better feature representation. The DEDUCE model conducts large-scale experiments on simulated multi-omics data sets, single-cell multi-omics data sets and cancer multi-omics data sets, and the results are better than 10 deep learning models. Finally, we used the DEDUCE model to reveal six cancer subtypes of AML. By analyzing GO functional enrichment, subtype-specific biological functions and GSEA of AML,
    Low Tensor Rank Learning of Neural Dynamics. (arXiv:2308.11567v2 [q-bio.NC] UPDATED)
    Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide insight on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent dynamics from large-scale neural recordings.
    HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks. (arXiv:2309.08549v2 [cs.LG] UPDATED)
    While numerous defense methods have been proposed to prohibit potential poisoning attacks from untrusted data sources, most research works only defend against specific attacks, which leaves many avenues for an adversary to exploit. In this work, we propose an efficient and robust training approach to defend against data poisoning attacks based on influence functions, named Healthy Influential-Noise based Training. Using influence functions, we craft healthy noise that helps to harden the classification model against poisoning attacks without significantly affecting the generalization ability on test data. In addition, our method can perform effectively when only a subset of the training data is modified, instead of the current method of adding noise to all examples that has been used in several previous works. We conduct comprehensive evaluations over two image datasets with state-of-the-art poisoning attacks under different realistic attack scenarios. Our empirical results show that HINT can efficiently protect deep learning models against the effect of both untargeted and targeted poisoning attacks.
    PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability. (arXiv:2310.03906v5 [cs.LG] UPDATED)
    The increasing global emphasis on sustainability and reducing carbon emissions is pushing governments and corporations to rethink their approach to data center design and operation. Given their high energy consumption and exponentially large computational workloads, data centers are prime candidates for optimizing power consumption, especially in areas such as cooling and IT energy usage. A significant challenge in this pursuit is the lack of a configurable and scalable thermal data center model that offers an end-to-end pipeline. Data centers consist of multiple IT components whose geometric configuration and heat dissipation make thermal modeling difficult. This paper presents PyDCM, a customizable Data Center Model implemented in Python, that allows users to create unique configurations of IT equipment with custom server specifications and geometric arrangements of IT cabinets. The use of vectorized thermal calculations makes PyDCM orders of magnitude faster (30 times) than current Energy Plus modeling implementations and scales sublinearly with the number of CPUs. Also, PyDCM enables the use of Deep Reinforcement Learning via the Gymnasium wrapper to optimize data center cooling and offers a user-friendly platform for testing various data center design prototypes.
    Adaptive Linear Estimating Equations. (arXiv:2307.07320v2 [math.ST] UPDATED)
    Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.
    On the Computational Entanglement of Distant Features in Adversarial Machine Learning. (arXiv:2309.15669v3 [cs.LG] UPDATED)
    Adversarial examples in machine learning has emerged as a focal point of research due to their remarkable ability to deceive models with seemingly inconspicuous input perturbations, potentially resulting in severe consequences. In this study, we embark on a comprehensive exploration of adversarial machine learning models, shedding light on their intrinsic complexity and interpretability. Our investigation reveals intriguing links between machine learning model complexity and Einstein's theory of special relativity, all through the lens of entanglement. While our work does not primarily center on quantum entanglement, we instead define the entanglement correlations we have discovered to be computational, and demonstrate that distant feature samples can be entangled, strongly resembling entanglement correlation in the quantum realm. This revelation bestows fresh insights for understanding the phenomenon of emergent adversarial examples in modern machine learning, potentially paving the way for more robust and interpretable models in this rapidly evolving field.
    Adaptive Data Analysis in a Balanced Adversarial Model. (arXiv:2305.15452v2 [cs.LG] UPDATED)
    In adaptive data analysis, a mechanism gets $n$ i.i.d. samples from an unknown distribution $D$, and is required to provide accurate estimations to a sequence of adaptively chosen statistical queries with respect to $D$. Hardt and Ullman (FOCS 2014) and Steinke and Ullman (COLT 2015) showed that in general, it is computationally hard to answer more than $\Theta(n^2)$ adaptive queries, assuming the existence of one-way functions. However, these negative results strongly rely on an adversarial model that significantly advantages the adversarial analyst over the mechanism, as the analyst, who chooses the adaptive queries, also chooses the underlying distribution $D$. This imbalance raises questions with respect to the applicability of the obtained hardness results -- an analyst who has complete knowledge of the underlying distribution $D$ would have little need, if at all, to issue statistical queries to a mechanism which only holds a finite number of samples from $D$. We consider more restricted adversaries, called \emph{balanced}, where each such adversary consists of two separated algorithms: The \emph{sampler} who is the entity that chooses the distribution and provides the samples to the mechanism, and the \emph{analyst} who chooses the adaptive queries, but has no prior knowledge of the underlying distribution (and hence has no a priori advantage with respect to the mechanism). We improve the quality of previous lower bounds by revisiting them using an efficient \emph{balanced} adversary, under standard public-key cryptography assumptions. We show that these stronger hardness assumptions are unavoidable in the sense that any computationally bounded \emph{balanced} adversary that has the structure of all known attacks, implies the existence of public-key cryptography.
    Guiding The Last Layer in Federated Learning with Pre-Trained Models. (arXiv:2306.03937v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging paradigm that allows a model to be trained across a number of participants without sharing data. Recent works have begun to consider the effects of using pre-trained models as an initialization point for existing FL algorithms; however, these approaches ignore the vast body of efficient transfer learning literature from the centralized learning setting. Here we revisit the problem of FL from a pre-trained model considered in prior work and expand it to a set of computer vision transfer learning problems. We first observe that simply fitting a linear classification head can be efficient and effective in many cases. We then show that in the FL setting, fitting a classifier using the Nearest Class Means (NCM) can be done exactly and orders of magnitude more efficiently than existing proposals, while obtaining strong performance. Finally, we demonstrate that using a two-phase approach of obtaining the classifier and then fine-tuning the model can yield rapid convergence and improved generalization in the federated setting. We demonstrate the potential our method has to reduce communication and compute costs while achieving better model performance.
    Explainable Representation Learning of Small Quantum States. (arXiv:2306.05694v3 [quant-ph] UPDATED)
    Unsupervised machine learning models build an internal representation of their training data without the need for explicit human guidance or feature engineering. This learned representation provides insights into which features of the data are relevant for the task at hand. In the context of quantum physics, training models to describe quantum states without human intervention offers a promising approach to gaining insight into how machines represent complex quantum states. The ability to interpret the learned representation may offer a new perspective on non-trivial features of quantum systems and their efficient representation. We train a generative model on two-qubit density matrices generated by a parameterized quantum circuit. In a series of computational experiments, we investigate the learned representation of the model and its internal understanding of the data. We observe that the model learns an interpretable representation which relates the quantum states to their underlying entanglement characteristics. In particular, our results demonstrate that the latent representation of the model is directly correlated with the entanglement measure concurrence. The insights from this study represent proof of concept towards interpretable machine learning of quantum states. Our approach offers insight into how machines learn to represent small-scale quantum systems autonomously.
    Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models. (arXiv:2306.09869v3 [cs.CV] UPDATED)
    Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing. Code: https://github.com/EnergyAttention/Energy-Based-CrossAttention.
    Towards Symmetry-Aware Generation of Periodic Materials. (arXiv:2307.02707v2 [cs.LG] UPDATED)
    We consider the problem of generating periodic materials with deep models. While symmetry-aware molecule generation has been studied extensively, periodic materials possess different symmetries, which have not been completely captured by existing methods. In this work, we propose SyMat, a novel material generation approach that can capture physical symmetries of periodic material structures. SyMat generates atom types and lattices of materials through generating atom type sets, lattice lengths and lattice angles with a variational auto-encoder model. In addition, SyMat employs a score-based diffusion model to generate atom coordinates of materials, in which a novel symmetry-aware probabilistic model is used in the coordinate diffusion process. We show that SyMat is theoretically invariant to all symmetry transformations on materials and demonstrate that SyMat achieves promising performance on random generation and property optimization tasks. Our code is publicly available as part of the AIRS library (https://github.com/divelab/AIRS).
    Enabling Efficient, Reliable Real-World Reinforcement Learning with Approximate Physics-Based Models. (arXiv:2307.08168v2 [cs.LG] UPDATED)
    We focus on developing efficient and reliable policy optimization strategies for robot learning with real-world data. In recent years, policy gradient methods have emerged as a promising paradigm for training control policies in simulation. However, these approaches often remain too data inefficient or unreliable to train on real robotic hardware. In this paper we introduce a novel policy gradient-based policy optimization framework which systematically leverages a (possibly highly simplified) first-principles model and enables learning precise control policies with limited amounts of real-world data. Our approach $1)$ uses the derivatives of the model to produce sample-efficient estimates of the policy gradient and $2)$ uses the model to design a low-level tracking controller, which is embedded in the policy class. Theoretical analysis provides insight into how the presence of this feedback controller overcomes key limitations of stand-alone policy gradient methods, while hardware experiments with a small car and quadruped demonstrate that our approach can learn precise control strategies reliably and with only minutes of real-world data.
    Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks. (arXiv:2303.05445v3 [cs.LG] UPDATED)
    Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.
    For SALE: State-Action Representation Learning for Deep Reinforcement Learning. (arXiv:2306.02451v2 [cs.LG] UPDATED)
    In the field of reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.
    A Theory for Emergence of Complex Skills in Language Models. (arXiv:2307.15936v2 [cs.LG] UPDATED)
    A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
    PGraphDTA: Improving Drug Target Interaction Prediction using Protein Language Models and Contact Maps. (arXiv:2310.04017v2 [cs.LG] UPDATED)
    Developing and discovering new drugs is a complex and resource-intensive endeavor that often involves substantial costs, time investment, and safety concerns. A key aspect of drug discovery involves identifying novel drug-target (DT) interactions. Existing computational methods for predicting DT interactions have primarily focused on binary classification tasks, aiming to determine whether a DT pair interacts or not. However, protein-ligand interactions exhibit a continuum of binding strengths, known as binding affinity, presenting a persistent challenge for accurate prediction. In this study, we investigate various techniques employed in Drug Target Interaction (DTI) prediction and propose novel enhancements to enhance their performance. Our approaches include the integration of Protein Language Models (PLMs) and the incorporation of Contact Map information as an inductive bias within current models. Through extensive experimentation, we demonstrate that our proposed approaches outperform the baseline models considered in this study, presenting a compelling case for further development in this direction. We anticipate that the insights gained from this work will significantly narrow the search space for potential drugs targeting specific proteins, thereby accelerating drug discovery. Code and data for PGraphDTA are available at https://github.com/Yijia-Xiao/PgraphDTA/.
    HACMan: Learning Hybrid Actor-Critic Maps for 6D Non-Prehensile Manipulation. (arXiv:2305.03942v4 [cs.RO] UPDATED)
    Manipulating objects without grasping them is an essential component of human dexterity, referred to as non-prehensile manipulation. Non-prehensile manipulation may enable more complex interactions with the objects, but also presents challenges in reasoning about gripper-object interactions. In this work, we introduce Hybrid Actor-Critic Maps for Manipulation (HACMan), a reinforcement learning approach for 6D non-prehensile manipulation of objects using point cloud observations. HACMan proposes a temporally-abstracted and spatially-grounded object-centric action representation that consists of selecting a contact location from the object point cloud and a set of motion parameters describing how the robot will move after making contact. We modify an existing off-policy RL algorithm to learn in this hybrid discrete-continuous action representation. We evaluate HACMan on a 6D object pose alignment task in both simulation and in the real world. On the hardest version of our task, with randomized initial poses, randomized 6D goals, and diverse object categories, our policy demonstrates strong generalization to unseen object categories without a performance drop, achieving an 89% success rate on unseen objects in simulation and 50% success rate with zero-shot transfer in the real world. Compared to alternative action representations, HACMan achieves a success rate more than three times higher than the best baseline. With zero-shot sim2real transfer, our policy can successfully manipulate unseen objects in the real world for challenging non-planar goals, using dynamic and contact-rich non-prehensile skills. Videos can be found on the project website: https://hacman-2023.github.io.
    Finding Counterfactually Optimal Action Sequences in Continuous State Spaces. (arXiv:2306.03929v2 [cs.LG] UPDATED)
    Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to retrospectively analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.
    Learning Large Graph Property Prediction via Graph Segment Training. (arXiv:2305.12322v3 [cs.LG] UPDATED)
    Learning to predict properties of large graphs is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded. Here we propose Graph Segment Training (GST), a general framework that utilizes a divide-and-conquer approach to allow learning large graph property prediction with a constant memory footprint. GST first divides a large graph into segments and then backpropagates through only a few segments sampled per training iteration. We refine the GST paradigm by introducing a historical embedding table to efficiently obtain embeddings for segments not sampled for backpropagation. To mitigate the staleness of historical embeddings, we design two novel techniques. First, we finetune the prediction head to fix the input distribution shift. Second, we introduce Stale Embedding Dropout to drop some stale embeddings during training to reduce bias. We evaluate our complete method GST-EFD (with all the techniques together) on two large graph property prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD is both memory-efficient and fast, while offering a slight boost on test accuracy over a typical full graph training regime.
    Active Vision Reinforcement Learning under Limited Visual Observability. (arXiv:2306.00975v2 [cs.LG] UPDATED)
    In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where an embodied agent simultaneously learns action policy for the task while also controlling its visual observations in partially observable environments. We denote the former as motor policy and the latter as sensory policy. For example, humans solve real world tasks by hand manipulation (motor policy) together with eye movements (sensory policy). ActiveVision-RL poses challenges on coordinating two policies given their mutual influence. We propose SUGARL, Sensorimotor Understanding Guided Active Reinforcement Learning, a framework that models motor and sensory policies separately, but jointly learns them using with an intrinsic sensorimotor reward. This learnable reward is assigned by sensorimotor reward module, incentivizes the sensory policy to select observations that are optimal to infer its own motor action, inspired by the sensorimotor stage of humans. Through a series of experiments, we show the effectiveness of our method across a range of observability conditions and its adaptability to existed RL algorithms. The sensory policies learned through our method are observed to exhibit effective active vision strategies.
    Uncertainty Quantification via Neural Posterior Principal Components. (arXiv:2309.15533v2 [cs.CV] UPDATED)
    Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. Yet, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Code and examples are available at https://eliasnehme.github.io/NPPC/
    DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training. (arXiv:2310.02025v2 [cs.LG] UPDATED)
    Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinate-wise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsity-induced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box.
    ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models. (arXiv:2310.18208v2 [cs.CL] UPDATED)
    Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.
    A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence. (arXiv:2301.13139v3 [stat.ML] UPDATED)
    Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
    RECKONING: Reasoning through Dynamic Knowledge Encoding. (arXiv:2305.06349v3 [cs.CL] UPDATED)
    Recent studies on transformer-based language models show that they can answer questions by reasoning over knowledge provided as part of the context (i.e., in-context reasoning). However, since the available knowledge is often not filtered for a particular question, in-context reasoning can be sensitive to distractor facts, additional content that is irrelevant to a question but that may be relevant for a different question (i.e., not necessarily random noise). In these situations, the model fails to distinguish the knowledge that is necessary to answer the question, leading to spurious reasoning and degraded performance. This reasoning failure contrasts with the model's apparent ability to distinguish its contextual knowledge from all the knowledge it has memorized during pre-training. Following this observation, we propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters before presenting it with a question. Our method, RECKONING, is a bi-level learning algorithm that teaches language models to reason by updating their parametric knowledge through back-propagation, allowing them to then answer questions using the updated parameters. During training, the inner loop rapidly adapts a copy of the model weights to encode contextual knowledge into its parameters. In the outer loop, the model learns to use the updated weights to reproduce and answer reasoning questions about the memorized knowledge. Our experiments on two multi-hop reasoning datasets show that RECKONING's performance improves over the in-context reasoning baseline (by up to 4.5%). We also find that compared to in-context reasoning, RECKONING generalizes better to longer reasoning chains unseen during training, is more robust to distractors in the context, and is more computationally efficient when multiple questions are asked about the same knowledge.
    Disentangled Representation Learning with Large Language Models for Text-Attributed Graphs. (arXiv:2310.18152v2 [cs.CL] UPDATED)
    Text-attributed graphs (TAGs) are prevalent on the web and research over TAGs such as citation networks, e-commerce networks and social networks has attracted considerable attention in the web community. Recently, large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks. However, the existing works focus on harnessing the potential of LLMs solely relying on prompts to convey graph structure information to LLMs, thus suffering from insufficient understanding of the complex structural relationships within TAGs. To address this problem, in this paper we present the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the reasoning and predicting capabilities of LLMs for TAGs. Our proposed DGTL model incorporates graph structure information through tailored disentangled graph neural network (GNN) layers, enabling LLMs to capture the intricate relationships hidden in text-attributed graphs from multiple structural factors. Furthermore, DGTL operates with frozen pre-trained LLMs, reducing computational costs and allowing much more flexibility in combining with different LLM models. Experimental evaluations demonstrate the effectiveness of the proposed DGTL model on achieving superior or comparable performance over state-of-the-art baselines. Additionally, we also demonstrate that our DGTL model can offer natural language explanations for predictions, thereby significantly enhancing model interpretability.
    Crop Disease Classification using Support Vector Machines with Green Chromatic Coordinate (GCC) and Attention based feature extraction for IoT based Smart Agricultural Applications. (arXiv:2311.00429v2 [eess.IV] UPDATED)
    Crops hold paramount significance as they serve as the primary provider of energy, nutrition, and medicinal benefits for the human population. Plant diseases, however, can negatively affect leaves during agricultural cultivation, resulting in significant losses in crop output and economic value. Therefore, it is crucial for farmers to identify crop diseases. However, this method frequently necessitates hard work, a lot of planning, and in-depth familiarity with plant pathogens. Given these numerous obstacles, it is essential to provide solutions that can easily interface with mobile and IoT devices so that our farmers can guarantee the best possible crop development. Various machine learning (ML) as well as deep learning (DL) algorithms have been created & studied for the identification of plant disease detection, yielding substantial and promising results. This article presents a novel classification method that builds on prior work by utilising attention-based feature extraction, RGB channel-based chromatic analysis, Support Vector Machines (SVM) for improved performance, and the ability to integrate with mobile applications and IoT devices after quantization of information. Several disease classification algorithms were compared with the suggested model, and it was discovered that, in terms of accuracy, Vision Transformer-based feature extraction and additional Green Chromatic Coordinate feature with SVM classification achieved an accuracy of (GCCViT-SVM) - 99.69%, whereas after quantization for IoT device integration achieved an accuracy of - 97.41% while almost reducing 4x in size. Our findings have profound implications because they have the potential to transform how farmers identify crop illnesses with precise and fast information, thereby preserving agricultural output and ensuring food security.
    Bandit Social Learning: Exploration under Myopic Behavior. (arXiv:2302.07425v4 [cs.GT] UPDATED)
    We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms' expected rewards. We derive stark learning failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.
    Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding. (arXiv:2303.12513v2 [cs.CV] UPDATED)
    Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being underperformed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts.
    An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond. (arXiv:2305.16041v2 [stat.ML] UPDATED)
    We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.
    NashFormer: Leveraging Local Nash Equilibria for Semantically Diverse Trajectory Prediction. (arXiv:2305.17600v2 [cs.LG] UPDATED)
    Interactions between road agents present a significant challenge in trajectory prediction, especially in cases involving multiple agents. Because existing diversity-aware predictors do not account for the interactive nature of multi-agent predictions, they may miss these important interaction outcomes. In this paper, we propose NashFormer, a framework for trajectory prediction that leverages game-theoretic inverse reinforcement learning to improve coverage of multi-modal predictions. We use a training-time game-theoretic analysis as an auxiliary loss resulting in improved coverage and accuracy without presuming a taxonomy of actions for the agents. We demonstrate our approach on the interactive split of the Waymo Open Motion Dataset, including four subsets involving scenarios with high interaction complexity. Experiment results show that our predictor produces accurate predictions while covering $33\%$ more potential interactions versus a baseline model.
    A physics-informed and attention-based graph learning approach for regional electric vehicle charging demand prediction. (arXiv:2309.05259v2 [cs.LG] UPDATED)
    Along with the proliferation of electric vehicles (EVs), optimizing the use of EV charging space can significantly alleviate the growing load on intelligent transportation systems. As the foundation to achieve such an optimization, a spatiotemporal method for EV charging demand prediction in urban areas is required. Although several solutions have been proposed by using data-driven deep learning methods, it can be found that these performance-oriented methods may suffer from misinterpretations to correctly handle the reverse relationship between charging demands and prices. To tackle the emerging challenges of training an accurate and interpretable prediction model, this paper proposes a novel approach that enables the integration of graph and temporal attention mechanisms for feature extraction and the usage of physic-informed meta-learning in the model pre-training step for knowledge transfer. Evaluation results on a dataset of 18,013 EV charging piles in Shenzhen, China, show that the proposed approach, named PAG, can achieve state-of-the-art forecasting performance and the ability in understanding the adaptive changes in charging demands caused by price fluctuations.
    Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Models. (arXiv:2305.17076v2 [cs.LG] UPDATED)
    Wasserstein distributionally robust estimators have emerged as powerful models for prediction and decision-making under uncertainty. These estimators provide attractive generalization guarantees: the robust objective obtained from the training distribution is an exact upper bound on the true risk with high probability. However, existing guarantees either suffer from the curse of dimensionality, are restricted to specific settings, or lead to spurious error terms. In this paper, we show that these generalization guarantees actually hold on general classes of models, do not suffer from the curse of dimensionality, and can even cover distribution shifts at testing. We also prove that these results carry over to the newly-introduced regularized versions of Wasserstein distributionally robust problems.
    Sparse Modular Activation for Efficient Sequence Modeling. (arXiv:2306.11197v4 [cs.LG] UPDATED)
    Recent hybrid models combining Linear State Space Models (SSMs) with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks. However, current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. To address this limitation, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption of neural networks at both training and inference stages. To validate the effectiveness of SMA on sequence modeling, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including long sequence modeling, speech classification and language modeling, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity, and reveals the amount of attention needed for each task through the learned sparse activation patterns. Our code is publicly available at https://github.com/renll/SeqBoat.
    Optimizing Retrieval-augmented Reader Models via Token Elimination. (arXiv:2310.13682v2 [cs.CL] UPDATED)
    Fusion-in-Decoder (FiD) is an effective retrieval-augmented language model applied across a variety of open-domain tasks, such as question answering, fact checking, etc. In FiD, supporting passages are first retrieved and then processed using a generative model (Reader), which can cause a significant bottleneck in decoding time, particularly with long outputs. In this work, we analyze the contribution and necessity of all the retrieved passages to the performance of reader models, and propose eliminating some of the retrieved information, at the token level, that might not contribute essential information to the answer generation process. We demonstrate that our method can reduce run-time by up to 62.2%, with only a 2% reduction in performance, and in some cases, even improve the performance results.
    NODE-ImgNet: a PDE-informed effective and robust model for image denoising. (arXiv:2305.11049v2 [eess.IV] UPDATED)
    Inspired by the traditional partial differential equation (PDE) approach for image denoising, we propose a novel neural network architecture, referred as NODE-ImgNet, that combines neural ordinary differential equations (NODEs) with convolutional neural network (CNN) blocks. NODE-ImgNet is intrinsically a PDE model, where the dynamic system is learned implicitly without the explicit specification of the PDE. This naturally circumvents the typical issues associated with introducing artifacts during the learning process. By invoking such a NODE structure, which can also be viewed as a continuous variant of a residual network (ResNet) and inherits its advantage in image denoising, our model achieves enhanced accuracy and parameter efficiency. In particular, our model exhibits consistent effectiveness in different scenarios, including denoising gray and color images perturbed by Gaussian noise, as well as real-noisy images, and demonstrates superiority in learning from small image datasets.
    ProtoryNet - Interpretable Text Classification Via Prototype Trajectories. (arXiv:2007.01777v5 [cs.LG] UPDATED)
    We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public data sets show that ProtoryNet is more accurate than the baseline prototype-based deep neural net and reduces the performance gap compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report a survey result indicating that human users find ProtoryNet more intuitive and easier to understand than other prototype-based methods.
    Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v4 [stat.ME] UPDATED)
    A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify this assumption from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large.
    Mutual Information Regularized Offline Reinforcement Learning. (arXiv:2210.07484v2 [cs.LG] UPDATED)
    The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark,e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is available at https://github.com/sail-sg/MISA.
    Are you using test log-likelihood correctly?. (arXiv:2212.00219v3 [stat.ML] UPDATED)
    Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
    Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v3 [cs.LG] UPDATED)
    The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.
    Monotone Learning. (arXiv:2202.05246v3 [cs.LG] UPDATED)
    The amount of training-data is one of the key factors which determines the generalization capacity of learning algorithms. Intuitively, one expects the error rate to decrease as the amount of training-data increases. Perhaps surprisingly, natural attempts to formalize this intuition give rise to interesting and challenging mathematical questions. For example, in their classical book on pattern recognition, Devroye, Gyorfi, and Lugosi (1996) ask whether there exists a {monotone} Bayes-consistent algorithm. This question remained open for over 25 years, until recently Pestov (2021) resolved it for binary classification, using an intricate construction of a monotone Bayes-consistent algorithm. We derive a general result in multiclass classification, showing that every learning algorithm A can be transformed to a monotone one with similar performance. Further, the transformation is efficient and only uses a black-box oracle access to A. This demonstrates that one can provably avoid non-monotonic behaviour without compromising performance, thus answering questions asked by Devroye et al (1996), Viering, Mey, and Loog (2019), Viering and Loog (2021), and by Mhammedi (2021). Our transformation readily implies monotone learners in a variety of contexts: for example it extends Pestov's result to classification tasks with an arbitrary number of labels. This is in contrast with Pestov's work which is tailored to binary classification. In addition, we provide uniform bounds on the error of the monotone algorithm. This makes our transformation applicable in distribution-free settings. For example, in PAC learning it implies that every learnable class admits a monotone PAC learner. This resolves questions by Viering, Mey, and Loog (2019); Viering and Loog (2021); Mhammedi (2021).
    A Contrastive Approach to Online Change Point Detection. (arXiv:2206.10143v3 [stat.ML] UPDATED)
    We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.
    Multi-scale data reconstruction of turbulent rotating flows with Gappy POD, Extended POD and Generative Adversarial Networks. (arXiv:2210.11921v2 [physics.flu-dyn] UPDATED)
    Data reconstruction of rotating turbulent snapshots is investigated utilizing data-driven tools. This problem is crucial for numerous geophysical applications and fundamental aspects, given the concurrent effects of direct and inverse energy cascades, which lead to non-Gaussian statistics at both large and small scales. Data assimilation also serves as a tool to rank physical features within turbulence, by evaluating the performance of reconstruction in terms of the quality and quantity of the information used. Additionally, benchmarking various reconstruction techniques is essential to assess the trade-off between quantitative supremacy, implementation complexity, and explicability. In this study, we use linear and non-linear tools based on the Proper Orthogonal Decomposition (POD) and Generative Adversarial Network (GAN) for reconstructing rotating turbulence snapshots with spatial damages (inpainting). We focus on accurately reproducing both statistical properties and instantaneous velocity fields. Different gap sizes and gap geometries are investigated in order to assess the importance of coherency and multi-scale properties of the missing information. Surprisingly enough, concerning point-wise reconstruction, the non-linear GAN does not outperform one of the linear POD techniques. On the other hand, supremacy of the GAN approach is shown when the statistical multi-scale properties are compared. Similarly, extreme events in the gap region are better predicted when using GAN. The balance between point-wise error and statistical properties is controlled by the adversarial ratio, which determines the relative importance of the generator and the discriminator in the GAN training. Robustness against the measurement noise is also discussed.
    When Do We Need Graph Neural Networks for Node Classification?. (arXiv:2210.16979v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by additionally making use of graph structure based on the relational inductive bias (edge bias), rather than treating the nodes as collections of independent and identically distributed (i.i.d.) samples. Though GNNs are believed to outperform basic NNs in real-world tasks, it is found that in some cases, GNNs have little performance gain or even underperform graph-agnostic NNs. To identify these cases, based on graph signal processing and statistical hypothesis testing, we propose two measures which analyze the cases in which the edge bias in features and labels does not provide advantages. Based on the measures, a threshold value can be given to predict the potential performance advantages of graph-aware models over graph-agnostic models.
    Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes. (arXiv:2212.06132v3 [cs.LG] UPDATED)
    We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
    Comparative Knowledge Distillation. (arXiv:2311.02253v1 [cs.LG])
    In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. Critically, CKD provides additional learning signals to the student without making additional teacher calls. We also extend the principle of CKD to groups of samples, enabling even more efficient learning from limited teacher calls. Empirical evaluation across varied experimental settings indicates that CKD consistently outperforms state of the art data augmentation and KD techniques.
    Approximating CKY with Transformers. (arXiv:2305.02386v2 [cs.CL] UPDATED)
    We investigate the ability of transformer models to approximate the CKY algorithm, using them to directly predict a sentence's parse and thus avoid the CKY algorithm's cubic dependence on sentence length. We find that on standard constituency parsing benchmarks this approach achieves competitive or better performance than comparable parsers that make use of CKY, while being faster. We also evaluate the viability of this approach for parsing under \textit{random} PCFGs. Here we find that performance declines as the grammar becomes more ambiguous, suggesting that the transformer is not fully capturing the CKY computation. However, we also find that incorporating additional inductive bias is helpful, and we propose a novel approach that makes use of gradients with respect to chart representations in predicting the parse, in analogy with the CKY algorithm being a subgradient of a partition function variant with respect to the chart.
    Recommender Systems with Generative Retrieval. (arXiv:2305.05065v3 [cs.IR] UPDATED)
    Modern recommender systems perform large-scale retrieval by first embedding queries and item candidates in the same unified space, followed by approximate nearest neighbor search to select top candidates given a query embedding. In this paper, we propose a novel generative retrieval approach, where the retrieval model autoregressively decodes the identifiers of the target candidates. To that end, we create semantically meaningful tuple of codewords to serve as a Semantic ID for each item. Given Semantic IDs for items in a user session, a Transformer-based sequence-to-sequence model is trained to predict the Semantic ID of the next item that the user will interact with. To the best of our knowledge, this is the first Semantic ID-based generative model for recommendation tasks. We show that recommender systems trained with the proposed paradigm significantly outperform the current SOTA models on various datasets. In addition, we show that incorporating Semantic IDs into the sequence-to-sequence model enhances its ability to generalize, as evidenced by the improved retrieval performance observed for items with no prior interaction history.
    Regularized Linear Regression for Binary Classification. (arXiv:2311.02270v1 [cs.LG])
    Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.
    Detection and Localization of Melanoma Skin Cancer in Histopathological Whole Slide Images. (arXiv:2302.03014v4 [eess.IV] UPDATED)
    Melanoma diagnosed and treated in its early stages can increase the survival rate. A projected increase in skin cancer incidents and a dearth of dermatopathologists have emphasized the need for computational pathology (CPATH) systems. CPATH systems with deep learning (DL) models have the potential to identify the presence of melanoma by exploiting underlying morphological and cellular features. This paper proposes a DL method to detect melanoma and distinguish between normal skin and benign/malignant melanocytic lesions in Whole Slide Images (WSI). Our method detects lesions with high accuracy and localizes them on a WSI to identify potential regions of interest for pathologists. Interestingly, our DL method relies on using a single CNN network to create localization maps first and use them to perform slide-level predictions to determine patients who have melanoma. Our best model provides favorable patch-wise classification results with a 0.992 F1 score and 0.99 sensitivity on unseen data. The source code is https://github.com/RogerAmundsen/Melanoma-Diagnosis-and-Localization-from-Whole-Slide-Images-using-Convolutional-Neural-Networks.
    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. (arXiv:2303.08797v3 [cs.LG] UPDATED)
    A class of generative models that unifies flow-based and diffusion-based methods is introduced. These models extend the framework proposed in Albergo & Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time stochastic processes called `stochastic interpolants' to bridge any two arbitrary probability density functions exactly in finite time. These interpolants are built by combining data from the two prescribed densities with an additional latent variable that shapes the bridge in a flexible way. The time-dependent probability density function of the stochastic interpolant is shown to satisfy a first-order transport equation as well as a family of forward and backward Fokker-Planck equations with tunable diffusion coefficient. Upon consideration of the time evolution of an individual sample, this viewpoint immediately leads to both deterministic and stochastic generative models based on probability flow equations or stochastic differential equations with an adjustable level of noise. The drift coefficients entering these models are time-dependent velocity fields characterized as the unique minimizers of simple quadratic objective functions, one of which is a new objective for the score of the interpolant density. We show that minimization of these quadratic objectives leads to control of the likelihood for generative models built upon stochastic dynamics, while likelihood control for deterministic dynamics is more stringent. We also discuss connections with other methods such as score-based diffusion models, stochastic localization processes, probabilistic denoising techniques, and rectifying flows. In addition, we demonstrate that stochastic interpolants recover the Schr\"odinger bridge between the two target densities when explicitly optimizing over the interpolant. Finally, algorithmic aspects are discussed and the approach is illustrated on numerical examples.
    New Insights into Graph Convolutional Networks using Neural Tangent Kernels. (arXiv:2110.04060v2 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervised learning on graphs, and explains the above observations through the lens of Neural Tangent Kernels (NTKs). We derive NTKs corresponding to infinitely wide GCNs (with and without skip connections). Subsequently, we use the derived NTKs to identify that, with suitable normalisation, network depth does not always drastically reduce the performance of GCNs -- a fact that we also validate through extensive simulation. Furthermore, we propose NTK as an efficient `surrogate model' for GCNs that does not suffer from performance fluctuations due to hyper-parameter tuning since it is a hyper-parameter free deterministic kernel. The efficacy of this idea is demonstrated through a comparison of different skip connections for GCNs using the surrogate NTKs.
    The Exact Sample Complexity Gain from Invariances for Kernel Regression. (arXiv:2303.14269v2 [cs.LG] UPDATED)
    In practice, encoding invariances into models improves sample complexity. In this work, we study this phenomenon from a theoretical perspective. In particular, we provide minimax optimal rates for kernel ridge regression on compact manifolds, with a target function that is invariant to a group action on the manifold. Our results hold for any smooth compact Lie group action, even groups of positive dimension. For a finite group, the gain effectively multiplies the number of samples by the group size. For groups of positive dimension, the gain is observed by a reduction in the manifold's dimension, in addition to a factor proportional to the volume of the quotient space. Our proof takes the viewpoint of differential geometry, in contrast to the more common strategy of using invariant polynomials. This new geometric viewpoint on learning with invariances may be of independent interest.
    Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation. (arXiv:2310.18919v2 [cs.LG] UPDATED)
    Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
    Using DUCK-Net for Polyp Image Segmentation. (arXiv:2311.02239v1 [cs.CV])
    This paper presents a novel supervised convolutional neural network architecture, "DUCK-Net", capable of effectively learning and generalizing from small amounts of medical images to perform accurate segmentation tasks. Our model utilizes an encoder-decoder structure with a residual downsampling mechanism and a custom convolutional block to capture and process image information at multiple resolutions in the encoder segment. We employ data augmentation techniques to enrich the training set, thus increasing our model's performance. While our architecture is versatile and applicable to various segmentation tasks, in this study, we demonstrate its capabilities specifically for polyp segmentation in colonoscopy images. We evaluate the performance of our method on several popular benchmark datasets for polyp segmentation, Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, and ETIS-LARIBPOLYPDB showing that it achieves state-of-the-art results in terms of mean Dice coefficient, Jaccard index, Precision, Recall, and Accuracy. Our approach demonstrates strong generalization capabilities, achieving excellent performance even with limited training data. The code is publicly available on GitHub: https://github.com/RazvanDu/DUCK-Net
    Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization. (arXiv:2202.05423v3 [cs.LG] UPDATED)
    Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.
    Robust Fine-Tuning of Vision-Language Models for Domain Generalization. (arXiv:2311.02236v1 [cs.CV])
    Transfer learning enables the sharing of common knowledge among models for a variety of downstream tasks, but traditional methods suffer in limited training data settings and produce narrow models incapable of effectively generalizing under distribution shifts. Foundation models have recently demonstrated impressive zero-shot inference capabilities and robustness under distribution shifts. However, zero-shot evaluation for these models has been predominantly confined to benchmarks with simple distribution shifts, limiting our understanding of their effectiveness under the more realistic shifts found in practice. Moreover, common fine-tuning methods for these models have yet to be evaluated against vision models in few-shot scenarios where training data is limited. To address these gaps, we present a new recipe for few-shot fine-tuning of the popular vision-language foundation model CLIP and evaluate its performance on challenging benchmark datasets with realistic distribution shifts from the WILDS collection. Our experimentation demonstrates that, while zero-shot CLIP fails to match performance of trained vision models on more complex benchmarks, few-shot CLIP fine-tuning outperforms its vision-only counterparts in terms of in-distribution and out-of-distribution accuracy at all levels of training data availability. This provides a strong incentive for adoption of foundation models within few-shot learning applications operating with real-world data. Code is available at https://github.com/mit-ll/robust-vision-language-finetuning
    Predicting Ground Reaction Force from Inertial Sensors. (arXiv:2311.02287v1 [cs.LG])
    The study of ground reaction forces (GRF) is used to characterize the mechanical loading experienced by individuals in movements such as running, which is clinically applicable to identify athletes at risk for stress-related injuries. Our aim in this paper is to determine if data collected with inertial measurement units (IMUs), that can be worn by athletes during outdoor runs, can be used to predict GRF with sufficient accuracy to allow the analysis of its derived biomechanical variables (e.g., contact time and loading rate). In this paper, we consider lightweight approaches in contrast to state-of-the-art prediction using LSTM neural networks. Specifically, we compare use of LSTMs to k-Nearest Neighbors (KNN) regression as well as propose a novel solution, SVD Embedding Regression (SER), using linear regression between singular value decomposition embeddings of IMUs data (input) and GRF data (output). We evaluate the accuracy of these techniques when using training data collected from different athletes, from the same athlete, or both, and we explore the use of acceleration and angular velocity data from sensors at different locations (sacrum and shanks). Our results illustrate that simple machine learning methods such as SER and KNN can be similarly accurate or more accurate than LSTM neural networks, with much faster training times and hyperparameter optimization; in particular, SER and KNN are more accurate when personal training data are available, and KNN comes with benefit of providing provenance of prediction. Notably, the use of personal data reduces prediction errors of all methods for most biomechanical variables.
    Rethinking Symmetric Matrix Factorization: A More General and Better Clustering Perspective. (arXiv:2209.02528v3 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) is widely used for clustering with strong interpretability. Among general NMF problems, symmetric NMF is a special one that plays an important role in graph clustering where each element measures the similarity between data points. Most existing symmetric NMF algorithms require factor matrices to be nonnegative, and only focus on minimizing the gap between similarity matrix and its approximation for clustering, without giving a consideration to other potential regularization terms which can yield better clustering. In this paper, we explore factorizing a symmetric matrix that does not have to be nonnegative, presenting an efficient factorization algorithm with a regularization term to boost the clustering performance. Moreover, a more general framework is proposed to solve symmetric matrix factorization problems with different constraints on the factor matrices.
    State-wise Safe Reinforcement Learning With Pixel Observations. (arXiv:2311.02227v1 [cs.LG])
    Reinforcement Learning(RL) in the context of safe exploration has long grappled with the challenges of the delicate balance between maximizing rewards and minimizing safety violations, the complexities arising from contact-rich or non-smooth environments, and high-dimensional pixel observations. Furthermore, incorporating state-wise safety constraints in the exploration and learning process, where the agent is prohibited from accessing unsafe regions without prior knowledge, adds an additional layer of complexity. In this paper, we propose a novel pixel-observation safe RL algorithm that efficiently encodes state-wise safety constraints with unknown hazard regions through the introduction of a latent barrier function learning mechanism. As a joint learning framework, our approach first involves constructing a latent dynamics model with low-dimensional latent spaces derived from pixel observations. Subsequently, we build and learn a latent barrier function on top of the latent dynamics and conduct policy optimization simultaneously, thereby improving both safety and the total expected return. Experimental evaluations on the safety-gym benchmark suite demonstrate that our proposed method significantly reduces safety violations throughout the training process and demonstrates faster safety convergence compared to existing methods while achieving competitive results in reward return.
    Gray Learning from Non-IID Data with Out-of-distribution Samples. (arXiv:2206.09375v2 [cs.LG] UPDATED)
    The integrity of training data, even when annotated by experts, is far from guaranteed, especially for non-IID datasets comprising both in- and out-of-distribution samples. In an ideal scenario, the majority of samples would be in-distribution, while samples that deviate semantically would be identified as out-of-distribution and excluded during the annotation process. However, experts may erroneously classify these out-of-distribution samples as in-distribution, assigning them labels that are inherently unreliable. This mixture of unreliable labels and varied data types makes the task of learning robust neural networks notably challenging. We observe that both in- and out-of-distribution samples can almost invariably be ruled out from belonging to certain classes, aside from those corresponding to unreliable ground-truth labels. This opens the possibility of utilizing reliable complementary labels that indicate the classes to which a sample does not belong. Guided by this insight, we introduce a novel approach, termed \textit{Gray Learning} (GL), which leverages both ground-truth and complementary labels. Crucially, GL adaptively adjusts the loss weights for these two label types based on prediction confidence levels. By grounding our approach in statistical learning theory, we derive bounds for the generalization error, demonstrating that GL achieves tight constraints even in non-IID settings. Extensive experimental evaluations reveal that our method significantly outperforms alternative approaches grounded in robust statistics.
    Benign Overfitting for Two-layer ReLU Convolutional Neural Networks. (arXiv:2303.04145v2 [cs.LG] UPDATED)
    Modern deep learning models with great expressive power can be trained to overfit the training data but still generalize well. This phenomenon is referred to as \textit{benign overfitting}. Recently, a few studies have attempted to theoretically understand benign overfitting in neural networks. However, these works are either limited to neural networks with smooth activation functions or to the neural tangent kernel regime. How and when benign overfitting can occur in ReLU neural networks remains an open problem. In this work, we seek to answer this question by establishing algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk. Our result also reveals a sharp transition between benign and harmful overfitting under different conditions on data distribution in terms of test risk. Experiments on synthetic data back up our theory.
    Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs. (arXiv:2311.02262v1 [cs.CL])
    In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting with large language models (LLMs), we have a similar need - steering the model to pay closer attention to user-specified information, e.g., an instruction. Existing methods, however, are constrained to process plain text and do not support such a mechanism. This motivates us to introduce PASTA - Post-hoc Attention STeering Approach, a method that allows LLMs to read text with user-specified emphasis marks. To this end, PASTA identifies a small subset of attention heads and applies precise attention reweighting on them, directing the model attention to user-specified parts. Like prompting, PASTA is applied at inference time and does not require changing any model parameters. Experiments demonstrate that PASTA can substantially enhance an LLM's ability to follow user instructions or integrate new knowledge from user inputs, leading to a significant performance improvement on a variety of tasks, e.g., an average accuracy improvement of 22% for LLAMA-7B. Our code is publicly available at https://github.com/QingruZhang/PASTA .
    Distraction is All You Need for Fairness. (arXiv:2203.07593v3 [cs.LG] UPDATED)
    Bias in training datasets must be managed for various groups in classification tasks to ensure parity or equal treatment. With the recent growth in artificial intelligence models and their expanding role in automated decision-making, ensuring that these models are not biased is vital. There is an abundance of evidence suggesting that these models could contain or even amplify the bias present in the data on which they are trained, inherent to their objective function and learning algorithms; Many researchers direct their attention to this issue in different directions, namely, changing data to be statistically independent, adversarial training for restricting the capabilities of a particular competitor who aims to maximize parity, etc. These methods result in information loss and do not provide a suitable balance between accuracy and fairness or do not ensure limiting the biases in training. To this end, we propose a powerful strategy for training deep learning models called the Distraction module, which can be theoretically proven effective in controlling bias from affecting the classification results. This method can be utilized with different data types (e.g., Tabular, images, graphs, etc.). We demonstrate the potency of the proposed method by testing it on UCI Adult and Heritage Health datasets (tabular), POKEC-Z, POKEC-N and NBA datasets (graph), and CelebA dataset (vision). Using state-of-the-art methods proposed in the fairness literature for each dataset, we exhibit our model is superior to these proposed methods in minimizing bias and maintaining accuracy.
    The Potential of Wearable Sensors for Assessing Patient Acuity in Intensive Care Unit (ICU). (arXiv:2311.02251v1 [cs.LG])
    Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for developing an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients.
    A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v4 [cs.LG] UPDATED)
    We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its arm selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.
    Contrastive Multi-Modal Representation Learning for Spark Plug Fault Diagnosis. (arXiv:2311.02282v1 [cs.LG])
    Due to the incapability of one sensory measurement to provide enough information for condition monitoring of some complex engineered industrial mechanisms and also for overcoming the misleading noise of a single sensor, multiple sensors are installed to improve the condition monitoring of some industrial equipment. Therefore, an efficient data fusion strategy is demanded. In this research, we presented a Denoising Multi-Modal Autoencoder with a unique training strategy based on contrastive learning paradigm, both being utilized for the first time in the machine health monitoring realm. The presented approach, which leverages the merits of both supervised and unsupervised learning, not only achieves excellent performance in fusing multiple modalities (or views) of data into an enriched common representation but also takes data fusion to the next level wherein one of the views can be omitted during inference time with very slight performance reduction, or even without any reduction at all. The presented methodology enables multi-modal fault diagnosis systems to perform more robustly in case of sensor failure occurrence, and one can also intentionally omit one of the sensors (the more expensive one) in order to build a more cost-effective condition monitoring system without sacrificing performance for practical purposes. The effectiveness of the presented methodology is examined on a real-world private multi-modal dataset gathered under non-laboratory conditions from a complex engineered mechanism, an inline four-stroke spark-ignition engine, aiming for spark plug fault diagnosis. This dataset, which contains the accelerometer and acoustic signals as two modalities, has a very slight amount of fault, and achieving good performance on such a dataset promises that the presented method can perform well on other equipment as well.
    One-shot Imitation Learning via Interaction Warping. (arXiv:2306.12392v2 [cs.RO] UPDATED)
    Imitation learning of robot policies from few demonstrations is crucial in open-ended applications. We propose a new method, Interaction Warping, for learning SE(3) robotic manipulation policies from a single demonstration. We infer the 3D mesh of each object in the environment using shape warping, a technique for aligning point clouds across object instances. Then, we represent manipulation actions as keypoints on objects, which can be warped with the shape of the object. We show successful one-shot imitation learning on three simulated and real-world object re-arrangement tasks. We also demonstrate the ability of our method to predict object meshes and robot grasps in the wild.  ( 2 min )
    Contrast Everything: A Hierarchical Contrastive Framework for Medical Time-Series. (arXiv:2310.14017v4 [cs.LG] UPDATED)
    Contrastive representation learning is crucial in medical time series analysis as it alleviates dependency on labor-intensive, domain-specific, and scarce expert annotations. However, existing contrastive learning methods primarily focus on one single data level, which fails to fully exploit the intricate nature of medical time series. To address this issue, we present COMET, an innovative hierarchical framework that leverages data consistencies at all inherent levels in medical time series. Our meticulously designed model systematically captures data consistency from four potential levels: observation, sample, trial, and patient levels. By developing contrastive loss at multiple levels, we can learn effective representations that preserve comprehensive data consistency, maximizing information utilization in a self-supervised manner. We conduct experiments in the challenging patient-independent setting. We compare COMET against six baselines using three diverse datasets, which include ECG signals for myocardial infarction and EEG signals for Alzheimer's and Parkinson's diseases. The results demonstrate that COMET consistently outperforms all baselines, particularly in setup with 10% and 1% labeled data fractions across all datasets. These results underscore the significant impact of our framework in advancing contrastive representation learning techniques for medical time series. The source code is available at https://github.com/DL4mHealth/COMET.
    Robust representations of oil wells' intervals via sparse attention mechanism. (arXiv:2212.14246v3 [cs.LG] UPDATED)
    Transformer-based neural network architectures achieve state-of-the-art results in different domains, from natural language processing (NLP) to computer vision (CV). The key idea of Transformers, the attention mechanism, has already led to significant breakthroughs in many areas. The attention has found their implementation for time series data as well. However, due to the quadratic complexity of the attention calculation regarding input sequence length, the application of Transformers is limited by high resource demands. Moreover, their modifications for industrial time series need to be robust to missing or noised values, which complicates the expansion of the horizon of their application. To cope with these issues, we introduce the class of efficient Transformers named Regularized Transformers (Reguformers). We implement the regularization technique inspired by the dropout ideas to improve robustness and reduce computational expenses. The focus in our experiments is on oil&gas data, namely, well logs, a prominent example of multivariate time series. The goal is to solve the problems of similarity and representation learning for them. To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells. The experiments show that all variations of Reguformers outperform the previously developed RNNs, classical Transformer model, and robust modifications of it like Informer and Performer in terms of well-intervals' classification and the quality of the obtained well-intervals' representations. Moreover, the sustainability to missing and incorrect data in our models exceeds that of others by a significant margin. The best result that the Reguformer achieves on well-interval similarity task is the mean PR~AUC score equal to 0.983, which is comparable to the classical Transformer and outperforms the previous models.
    ZRG: A Dataset for Multimodal 3D Residential Rooftop Understanding. (arXiv:2304.13219v2 [cs.CV] UPDATED)
    A crucial part of any home is the roof over our heads to protect us from the elements. In this paper we present the Zeitview Rooftop Geometry (ZRG) dataset for residential rooftop understanding. ZRG is a large-scale residential rooftop dataset of over 20k properties collected through roof inspections from across the U.S. and contains multiple modalities including high resolution aerial orthomosaics, digital surface models (DSM), colored point clouds, and 3D roof wireframe annotations. We provide an in-depth analysis and perform several experimental baselines including roof outline extraction, monocular height estimation, and planar roof structure extraction, to illustrate a few of the numerous potential applications unlocked by this dataset.
    Imitation Bootstrapped Reinforcement Learning. (arXiv:2311.02198v1 [cs.LG])
    Despite the considerable potential of reinforcement learning (RL), robotics control tasks predominantly rely on imitation learning (IL) owing to its better sample efficiency. However, given the high cost of collecting extensive demonstrations, RL is still appealing if it can utilize limited imitation data for efficient autonomous self-improvement. Existing RL methods that utilize demonstrations either initialize the replay buffer with demonstrations and oversample them during RL training, which does not benefit from the generalization potential of modern IL methods, or pretrain the RL policy with IL on the demonstrations, which requires additional mechanisms to prevent catastrophic forgetting during RL fine-tuning. We propose imitation bootstrapped reinforcement learning (IBRL), a novel framework that first trains an IL policy on a limited number of demonstrations and then uses it to propose alternative actions for both online exploration and target value bootstrapping. IBRL achieves SoTA performance and sample efficiency on 7 challenging sparse reward continuous control tasks in simulation while learning directly from pixels. As a highlight of our method, IBRL achieves $6.4\times$ higher success rate than RLPD, a strong method that combines the idea of oversampling demonstrations with modern RL improvements, under the budget of 10 demos and 100K interactions in the challenging PickPlaceCan task in the Robomimic benchmark.
    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v3 [stat.ML] UPDATED)
    Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.
    A Closer Look at Reward Decomposition for High-Level Robotic Explanations. (arXiv:2304.12958v2 [cs.LG] UPDATED)
    Explaining the behaviour of intelligent agents learned by reinforcement learning (RL) to humans is challenging yet crucial due to their incomprehensible proprioceptive states, variational intermediate goals, and resultant unpredictability. Moreover, one-step explanations for RL agents can be ambiguous as they fail to account for the agent's future behaviour at each transition, adding to the complexity of explaining robot actions. By leveraging abstracted actions that map to task-specific primitives, we avoid explanations on the movement level. To further improve the transparency and explainability of robotic systems, we propose an explainable Q-Map learning framework that combines reward decomposition (RD) with abstracted action spaces, allowing for non-ambiguous and high-level explanations based on object properties in the task. We demonstrate the effectiveness of our framework through quantitative and qualitative analysis of two robotic scenarios, showcasing visual and textual explanations, from output artefacts of RD explanations, that are easy for humans to comprehend. Additionally, we demonstrate the versatility of integrating these artefacts with large language models (LLMs) for reasoning and interactive querying.
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v4 [stat.ML] UPDATED)
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
    Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study. (arXiv:2305.03017v4 [cs.SE] UPDATED)
    Our research investigates the recommendation of code examples to aid software developers, a practice that saves developers significant time by providing ready-to-use code snippets. The focus of our study is Stack Overflow, a commonly used resource for coding discussions and solutions, particularly in the context of the Java programming language. We applied BERT, a powerful Large Language Model (LLM) that enables us to transform code examples into numerical vectors by extracting their semantic information. Once these numerical representations are prepared, we identify Approximate Nearest Neighbors (ANN) using Locality-Sensitive Hashing (LSH). Our research employed two variants of LSH: Random Hyperplane-based LSH and Query-Aware LSH. We rigorously compared these two approaches across four parameters: HitRate, Mean Reciprocal Rank (MRR), Average Execution Time, and Relevance. Our study revealed that the Query-Aware (QA) approach showed superior performance over the Random Hyperplane-based (RH) method. Specifically, it exhibited a notable improvement of 20\% to 35\% in HitRate for query pairs compared to the RH approach. Furthermore, the QA approach proved significantly more time-efficient, with its speed in creating hashing tables and assigning data samples to buckets being at least four times faster. It can return code examples within milliseconds, whereas the RH approach typically requires several seconds to recommend code examples. Due to the superior performance of the QA approach, we tested it against PostFinder and FaCoY, the state-of-the-art baselines. Our QA method showed comparable efficiency proving its potential for effective code recommendation.
    AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. (arXiv:2301.08110v5 [cs.LG] UPDATED)
    Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
    An adaptive safety layer with hard constraints for safe reinforcement learning in multi-energy management systems. (arXiv:2304.08897v3 [eess.SY] UPDATED)
    Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model. The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learnt, and modelling bias is kept to a minimum. However, even the constraint functions alone are not always trivial to accurately provide in advance, leading to potentially unsafe behaviour. In this paper, we present two novel advancements: (I) combining the OptLayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while keeping a high sample efficiency and the possibility to formulate equality constraints. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more and new data becomes available so that better policies can be learnt. Both advancements keep the constraint formulation decoupled from the RL formulation, so new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. Although introducing surrogate functions into the optimisation problem requires special attention, we conclude that the newly presented GreyOptLayerPolicy method is the most advantageous.
    Gaussian Process Probes (GPP) for Uncertainty-Aware Probing. (arXiv:2305.18213v2 [cs.LG] UPDATED)
    Understanding which concepts models can and cannot represent has been fundamental to many tasks: from effective and responsible use of models to detecting out of distribution data. We introduce Gaussian process probes (GPP), a unified and simple framework for probing and measuring uncertainty about concepts represented by models. As a Bayesian extension of linear probing methods, GPP asks what kind of distribution over classifiers (of concepts) is induced by the model. This distribution can be used to measure both what the model represents and how confident the probe is about what the model represents. GPP can be applied to any pre-trained model with vector representations of inputs (e.g., activations). It does not require access to training data, gradients, or the architecture. We validate GPP on datasets containing both synthetic and real images. Our experiments show it can (1) probe a model's representations of concepts even with a very small number of examples, (2) accurately measure both epistemic uncertainty (how confident the probe is) and aleatory uncertainty (how fuzzy the concepts are to the model), and (3) detect out of distribution data using those uncertainty measures as well as classic methods do. By using Gaussian processes to expand what probing can offer, GPP provides a data-efficient, versatile and uncertainty-aware tool for understanding and evaluating the capabilities of machine learning models.  ( 3 min )
    Hierarchical Reinforcement Learning for Power Network Topology Control. (arXiv:2311.02129v1 [cs.LG])
    Learning in high-dimensional action spaces is a key challenge in applying reinforcement learning (RL) to real-world systems. In this paper, we study the possibility of controlling power networks using RL methods. Power networks are critical infrastructures that are complex to control. In particular, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Hierarchical reinforcement learning (HRL) represents one approach to address this challenge. More precisely, a HRL framework for power network topology control is proposed. The HRL framework consists of three levels of action abstraction. At the highest level, there is the overall long-term task of power network operation, namely, keeping the power grid state within security constraints at all times, which is decomposed into two temporally extended actions: 'do nothing' versus 'propose a topology change'. At the intermediate level, the action space consists of all controllable substations. Finally, at the lowest level, the action space consists of all configurations of the chosen substation. By employing this HRL framework, several hierarchical power network agents are trained for the IEEE 14-bus network. Whereas at the highest level a purely rule-based policy is still chosen for all agents in this study, at the intermediate level the policy is trained using different state-of-the-art RL algorithms. At the lowest level, either an RL algorithm or a greedy algorithm is used. The performance of the different 3-level agents is compared with standard baseline (RL or greedy) approaches. A key finding is that the 3-level agent that employs RL both at the intermediate and the lowest level outperforms all other agents on the most difficult task. Our code is publicly available.
    Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data. (arXiv:2311.02216v1 [cs.CL])
    Numbers are crucial for various real-world domains such as finance, economics, and science. Thus, understanding and reasoning with numbers are essential skills for language models to solve different tasks. While different numerical benchmarks have been introduced in recent years, they are limited to specific numerical aspects mostly. In this paper, we propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels: representation, number sense, manipulation, and complex reasoning. We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them. Henceforth, we develop a diverse set of numerical probes employing a semi-automated approach. We focus on the tabular Natural Language Inference (TNLI) task as a case study and measure models' performance shifts. Our results show that no model consistently excels across all numerical reasoning types. Among the probed models, FlanT5 (few-/zero-shot) and GPT-3.5 (few-shot) demonstrate strong overall numerical reasoning skills compared to other models. Label-flipping probes indicate that models often exploit dataset artifacts to predict the correct labels.
    Automating Governing Knowledge Commons and Contextual Integrity (GKC-CI) Privacy Policy Annotations with Large Language Models. (arXiv:2311.02192v1 [cs.CY])
    Identifying contextual integrity (CI) and governing knowledge commons (GKC) parameters in privacy policy texts can facilitate normative privacy analysis. However, GKC-CI annotation has heretofore required manual or crowdsourced effort. This paper demonstrates that high-accuracy GKC-CI parameter annotation of privacy policies can be performed automatically using large language models. We fine-tune 18 open-source and proprietary models on 21,588 GKC-CI annotations from 16 ground truth privacy policies. Our best-performing model (fine-tuned GPT-3.5 Turbo with prompt engineering) has an accuracy of 86%, exceeding the performance of prior crowdsourcing approaches despite the complexity of privacy policy texts and the nuance of the GKC-CI annotation task. We apply our best-performing model to privacy policies from 164 popular online services, demonstrating the effectiveness of scaling GKC-CI annotation for data exploration. We make all annotated policies as well as the training data and scripts needed to fine-tune our best-performing model publicly available for future research.
    Structured Neural Networks for Density Estimation and Causal Inference. (arXiv:2311.02221v1 [cs.LG])
    Injecting structure into neural networks enables learning functions that satisfy invariances with respect to subsets of inputs. For instance, when learning generative models using neural networks, it is advantageous to encode the conditional independence structure of observed variables, often in the form of Bayesian networks. We propose the Structured Neural Network (StrNN), which injects structure through masking pathways in a neural network. The masks are designed via a novel relationship we explore between neural network architectures and binary matrix factorization, to ensure that the desired independencies are respected. We devise and study practical algorithms for this otherwise NP-hard design problem based on novel objectives that control the model architecture. We demonstrate the utility of StrNN in three applications: (1) binary and Gaussian density estimation with StrNN, (2) real-valued density estimation with Structured Autoregressive Flows (StrAFs) and Structured Continuous Normalizing Flows (StrCNF), and (3) interventional and counterfactual analysis with StrAFs for causal inference. Our work opens up new avenues for learning neural networks that enable data-efficient generative modeling and the use of normalizing flows for causal effect estimation.
    Explainable Authorship Identification in Cultural Heritage Applications: Analysis of a New Perspective. (arXiv:2311.02237v1 [cs.LG])
    While a substantial amount of work has recently been devoted to enhance the performance of computational Authorship Identification (AId) systems, little to no attention has been paid to endowing AId systems with the ability to explain the reasons behind their predictions. This lacking substantially hinders the practical employment of AId methodologies, since the predictions returned by such systems are hardly useful unless they are supported with suitable explanations. In this paper, we explore the applicability of existing general-purpose eXplainable Artificial Intelligence (XAI) techniques to AId, with a special focus on explanations addressed to scholars working in cultural heritage. In particular, we assess the relative merits of three different types of XAI techniques (feature ranking, probing, factuals and counterfactual selection) on three different AId tasks (authorship attribution, authorship verification, same-authorship verification) by running experiments on real AId data. Our analysis shows that, while these techniques make important first steps towards explainable Authorship Identification, more work remains to be done in order to provide tools that can be profitably integrated in the workflows of scholars.
    Joint Composite Latent Space Bayesian Optimization. (arXiv:2311.02213v1 [cs.LG])
    Bayesian Optimization (BO) is a technique for sample-efficient black-box optimization that employs probabilistic models to identify promising input locations for evaluation. When dealing with composite-structured functions, such as f=g o h, evaluating a specific location x yields observations of both the final outcome f(x) = g(h(x)) as well as the intermediate output(s) h(x). Previous research has shown that integrating information from these intermediate outputs can enhance BO performance substantially. However, existing methods struggle if the outputs h(x) are high-dimensional. Many relevant problems fall into this setting, including in the context of generative AI, molecular design, or robotics. To effectively tackle these challenges, we introduce Joint Composite Latent Space Bayesian Optimization (JoCo), a novel framework that jointly trains neural network encoders and probabilistic models to adaptively compress high-dimensional input and output spaces into manageable latent representations. This enables viable BO on these compressed representations, allowing JoCo to outperform other state-of-the-art methods in high-dimensional BO on a wide variety of simulated and real-world problems.
    Emergence of Abstract State Representations in Embodied Sequence Modeling. (arXiv:2311.02171v1 [cs.LG])
    Decision making via sequence modeling aims to mimic the success of language models, where actions taken by an embodied agent are modeled as tokens to predict. Despite their promising performance, it remains unclear if embodied sequence modeling leads to the emergence of internal representations that represent the environmental state information. A model that lacks abstract state representations would be liable to make decisions based on surface statistics which fail to generalize. We take the BabyAI environment, a grid world in which language-conditioned navigation tasks are performed, and build a sequence modeling Transformer, which takes a language instruction, a sequence of actions, and environmental observations as its inputs. In order to investigate the emergence of abstract state representations, we design a "blindfolded" navigation task, where only the initial environmental layout, the language instruction, and the action sequence to complete the task are available for training. Our probing results show that intermediate environmental layouts can be reasonably reconstructed from the internal activations of a trained model, and that language instructions play a role in the reconstruction accuracy. Our results suggest that many key features of state representations can emerge via embodied sequence modeling, supporting an optimistic outlook for applications of sequence modeling objectives to more complex embodied decision-making domains.
    AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation. (arXiv:2311.02194v1 [cs.LG])
    One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To avoid this curse of dimensionality, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, even when combined with standard conservatism principles, these methods can still result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.
    Learning Time-Invariant Representations for Individual Neurons from Population Dynamics. (arXiv:2311.02258v1 [q-bio.NC])
    Neurons can display highly variable dynamics. While such variability presumably supports the wide range of behaviors generated by the organism, their gene expressions are relatively stable in the adult brain. This suggests that neuronal activity is a combination of its time-invariant identity and the inputs the neuron receives from the rest of the circuit. Here, we propose a self-supervised learning based method to assign time-invariant representations to individual neurons based on permutation-, and population size-invariant summary of population recordings. We fit dynamical models to neuronal activity to learn a representation by considering the activity of both the individual and the neighboring population. Our self-supervised approach and use of implicit representations enable robust inference against imperfections such as partial overlap of neurons across sessions, trial-to-trial variability, and limited availability of molecular (transcriptomic) labels for downstream supervised tasks. We demonstrate our method on a public multimodal dataset of mouse cortical neuronal activity and transcriptomic labels. We report > 35% improvement in predicting the transcriptomic subclass identity and > 20% improvement in predicting class identity with respect to the state-of-the-art.
    Heteroskedastic Tensor Clustering. (arXiv:2311.02306v1 [math.ST])
    Tensor clustering, which seeks to extract underlying cluster structures from noisy tensor observations, has gained increasing attention. One extensively studied model for tensor clustering is the tensor block model, which postulates the existence of clustering structures along each mode and has found broad applications in areas like multi-tissue gene expression analysis and multilayer network analysis. However, currently available computationally feasible methods for tensor clustering either are limited to handling i.i.d. sub-Gaussian noise or suffer from suboptimal statistical performance, which restrains their utility in applications that have to deal with heteroskedastic data and/or low signal-to-noise-ratio (SNR). To overcome these challenges, we propose a two-stage method, named $\mathsf{High\text{-}order~HeteroClustering}$ ($\mathsf{HHC}$), which starts by performing tensor subspace estimation via a novel spectral algorithm called $\mathsf{Thresholded~Deflated\text{-}HeteroPCA}$, followed by approximate $k$-means to obtain cluster nodes. Encouragingly, our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit (ignoring logarithmic factors); here, the SNR refers to the ratio of the pairwise disparity between nodes to the noise level, and the computational limit indicates the lowest SNR that enables exact clustering with polynomial runtime. Comprehensive simulation and real-data experiments suggest that our algorithm outperforms existing algorithms across various settings, delivering more reliable clustering performance.
    Feature Attribution Explanations for Spiking Neural Networks. (arXiv:2311.02110v1 [cs.NE])
    Third-generation artificial neural networks, Spiking Neural Networks (SNNs), can be efficiently implemented on hardware. Their implementation on neuromorphic chips opens a broad range of applications, such as machine learning-based autonomous control and intelligent biomedical devices. In critical applications, however, insight into the reasoning of SNNs is important, thus SNNs need to be equipped with the ability to explain how decisions are reached. We present \textit{Temporal Spike Attribution} (TSA), a local explanation method for SNNs. To compute the explanation, we aggregate all information available in model-internal variables: spike times and model weights. We evaluate TSA on artificial and real-world time series data and measure explanation quality w.r.t. multiple quantitative criteria. We find that TSA correctly identifies a small subset of input features relevant to the decision (i.e., is output-complete and compact) and generates similar explanations for similar inputs (i.e., is continuous). Further, our experiments show that incorporating the notion of \emph{absent} spikes improves explanation quality. Our work can serve as a starting point for explainable SNNs, with future implementations on hardware yielding not only predictions but also explanations in a broad range of application scenarios. Source code is available at https://github.com/ElisaNguyen/tsa-explanations.
    A Systematic Review of Deep Graph Neural Networks: Challenges, Classification, Architectures, Applications & Potential Utility in Bioinformatics. (arXiv:2311.02127v1 [cs.LG])
    In recent years, tasks of machine learning ranging from image processing & audio/video analysis to natural language understanding have been transformed by deep learning. The data content in all these scenarios are expressed via Euclidean space. However, a considerable amount of application data is structured in non-Euclidean space and is expressed as graphs, e.g. dealing with complicated interactions & object interdependencies. Modelling physical systems, learning molecular signatures, identifying protein interactions and predicting diseases involve utilising a model that can adapt from graph data. Graph neural networks (GNNs), specified as artificial-neural models, employ message transmission between graph nodes to represent graph dependencies and are primarily used in the non-Euclidean domain. Variants of GNN like Graph Recurrent Networks (GRN), Graph Auto Encoder (GAE), Graph Convolution Networks (GCN), Graph Adversarial Methods & Graph Reinforcement learning have exhibited breakthrough productivity on a wide range of tasks, especially in the field of bioinformatics, in recent years as a result of the rapid collection of biological network data. Apart from presenting all existing GNN models, mathematical analysis and comparison of the variants of all types of GNN have been highlighted in this survey. Graph neural networks are investigated for their potential real-world applications in various fields, focusing on Bioinformatics. Furthermore, resources for evaluating graph neural network models and accessing open-source code & benchmark data sets are included. Ultimately, we provide some (seven) proposals for future research in this rapidly evolving domain. GNNs have the potential to be an excellent tool for solving a wide range of biological challenges in bioinformatics research, as they are best represented as connected complex graphs.
    A Comprehensive Study on Model Initialization Techniques Ensuring Efficient Federated Learning. (arXiv:2311.02100v1 [cs.LG])
    Advancement in the field of machine learning is unavoidable, but something of major concern is preserving the privacy of the users whose data is being used for training these machine learning algorithms. Federated learning(FL) has emerged as a promising paradigm for training machine learning models in a distributed and privacy-preserving manner which enables one to collaborate and train a global model without sharing local data. But starting this learning process on each device in the right way, called ``model initialization" is critical. The choice of initialization methods used for models plays a crucial role in the performance, convergence speed, communication efficiency, privacy guarantees of federated learning systems, etc. In this survey, we dive deeper into a comprehensive study of various ways of model initialization techniques in FL.Unlike other studies, our research meticulously compares, categorizes, and delineates the merits and demerits of each technique, examining their applicability across diverse FL scenarios. We highlight how factors like client variability, data non-IIDness, model caliber, security considerations, and network restrictions influence FL model outcomes and propose how strategic initialization can address and potentially rectify many such challenges. The motivation behind this survey is to highlight that the right start can help overcome challenges like varying data quality, security issues, and network problems. Our insights provide a foundational base for experts looking to fully utilize FL, also while understanding the complexities of model initialization.
    Efficient Machine Learning Ensemble Methods for Detecting Gravitational Wave Glitches in LIGO Time Series. (arXiv:2311.02106v1 [cs.LG])
    The phenomenon of Gravitational Wave (GW) analysis has grown in popularity as technology has advanced and the process of observing gravitational waves has become more precise. Although the sensitivity and the frequency of observation of GW signals are constantly improving, the possibility of noise in the collected GW data remains. In this paper, we propose two new Machine and Deep learning ensemble approaches (i.e., ShallowWaves and DeepWaves Ensembles) for detecting different types of noise and patterns in datasets from GW observatories. Our research also investigates various Machine and Deep Learning techniques for multi-class classification and provides a comprehensive benchmark, emphasizing the best results in terms of three commonly used performance metrics (i.e., accuracy, precision, and recall). We train and test our models on a dataset consisting of annotated time series from real-world data collected by the Advanced Laser Interferometer GW Observatory (LIGO). We empirically show that the best overall accuracy is obtained by the proposed DeepWaves Ensemble, followed close by the ShallowWaves Ensemble.
    Pairing-based graph neural network for simulating quantum materials. (arXiv:2311.02143v1 [cond-mat.str-el])
    We introduce a pairing-based graph neural network, $\textit{GemiNet}$, for simulating quantum many-body systems. Our architecture augments a BCS mean-field wavefunction with a generalized pair amplitude parameterized by a graph neural network. Variational Monte Carlo with GemiNet simultaneously provides an accurate, flexible, and scalable method for simulating many-electron systems. We apply GemiNet to two-dimensional semiconductor electron-hole bilayers and obtain highly accurate results on a variety of interaction-induced phases, including the exciton Bose-Einstein condensate, electron-hole superconductor, and bilayer Wigner crystal. Our study demonstrates the potential of physically-motivated neural network wavefunctions for quantum materials simulations.
    Combining Deep Learning on Order Books with Reinforcement Learning for Profitable Trading. (arXiv:2311.02088v1 [q-fin.CP])
    High-frequency trading is prevalent, where automated decisions must be made quickly to take advantage of price imbalances and patterns in price action that forecast near-future movements. While many algorithms have been explored and tested, analytical methods fail to harness the whole nature of the market environment by focusing on a limited domain. With the evergrowing machine learning field, many large-scale end-to-end studies on raw data have been successfully employed to increase the domain scope for profitable trading but are very difficult to replicate. Combining deep learning on the order books with reinforcement learning is one way of breaking down large-scale end-to-end learning into more manageable and lightweight components for reproducibility, suitable for retail trading. The following work focuses on forecasting returns across multiple horizons using order flow imbalance and training three temporal-difference learning models for five financial instruments to provide trading signals. The instruments used are two foreign exchange pairs (GBPUSD and EURUSD), two indices (DE40 and FTSE100), and one commodity (XAUUSD). The performances of these 15 agents are evaluated through backtesting simulation, and successful models proceed through to forward testing on a retail trading platform. The results prove potential but require further minimal modifications for consistently profitable trading to fully handle retail trading costs, slippage, and spread fluctuation.
    Client Orchestration and Cost-Efficient Joint Optimization for NOMA-Enabled Hierarchical Federated Learning. (arXiv:2311.02130v1 [cs.LG])
    Hierarchical federated learning (HFL) shows great advantages over conventional two-layer federated learning (FL) in reducing network overhead and interaction latency while still retaining the data privacy of distributed FL clients. However, the communication and energy overhead still pose a bottleneck for HFL performance, especially as the number of clients raises dramatically. To tackle this issue, we propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation in this paper, aiming to minimize the total cost of time and energy at each HFL global round. Specifically, we first propose a novel fuzzy logic based client orchestration policy considering client heterogenerity in multiple aspects, including channel quality, data quantity and model staleness. Subsequently, given the fuzzy based client-edge association, a joint edge server scheduling and resource allocation problem is formulated. Utilizing problem decomposition, we firstly derive the closed-form solution for the edge server scheduling subproblem via the penalty dual decomposition (PDD) method. Next, a deep deterministic policy gradient (DDPG) based algorithm is proposed to tackle the resource allocation subproblem considering time-varying environments. Finally, extensive simulations demonstrate that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
    Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. (arXiv:2305.15408v4 [cs.LG] UPDATED)
    Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.  ( 3 min )
    Transfer learning for atomistic simulations using GNNs and kernel mean embeddings. (arXiv:2306.01589v4 [cs.LG] UPDATED)
    Interatomic potentials learned using machine learning methods have been successfully applied to atomistic simulations. However, accurate models require large training datasets, while generating reference calculations is computationally demanding. To bypass this difficulty, we propose a transfer learning algorithm that leverages the ability of graph neural networks (GNNs) to represent chemical environments together with kernel mean embeddings. We extract a feature map from GNNs pre-trained on the OC20 dataset and use it to learn the potential energy surface from system-specific datasets of catalytic processes. Our method is further enhanced by incorporating into the kernel the chemical species information, resulting in improved performance and interpretability. We test our approach on a series of realistic datasets of increasing complexity, showing excellent generalization and transferability performance, and improving on methods that rely on GNNs or ridge regression alone, as well as similar fine-tuning approaches.  ( 2 min )
    Linear Oscillation: A Novel Activation Function for Vision Transformer. (arXiv:2308.13670v3 [cs.LG] UPDATED)
    Activation functions are the linchpins of deep learning, profoundly influencing both the representational capacity and training dynamics of neural networks. They shape not only the nature of representations but also optimize convergence rates and enhance generalization potential. Appreciating this critical role, we present the Linear Oscillation (LoC) activation function, defined as $f(x) = x \times \sin(\alpha x + \beta)$. Distinct from conventional activation functions which primarily introduce non-linearity, LoC seamlessly blends linear trajectories with oscillatory deviations. The nomenclature "Linear Oscillation" is a nod to its unique attribute of infusing linear activations with harmonious oscillations, capturing the essence of the "Importance of Confusion". This concept of "controlled confusion" within network activations is posited to foster more robust learning, particularly in contexts that necessitate discerning subtle patterns. Our empirical studies reveal that, when integrated into diverse neural architectures, the LoC activation function consistently outperforms established counterparts like ReLU and Sigmoid. The stellar performance exhibited by the avant-garde Vision Transformer model using LoC further validates its efficacy. This study illuminates the remarkable benefits of the LoC over other prominent activation functions. It champions the notion that intermittently introducing deliberate complexity or "confusion" during training can spur more profound and nuanced learning. This accentuates the pivotal role of judiciously selected activation functions in shaping the future of neural network training.
    WD3: Taming the Estimation Bias in Deep Reinforcement Learning. (arXiv:2006.12622v2 [cs.LG] UPDATED)
    The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics. In this paper, we show that the TD3 algorithm introduces underestimation bias in mild assumptions. To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm \underline{W}eighted \underline{D}elayed \underline{D}eep \underline{D}eterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics. To demonstrate the effectiveness of WD3, we compare the learning process of value function between DDPG, TD3, and WD3. The results verify that our algorithm does eliminate the estimation error of value functions. Furthermore, we evaluate our algorithm on the continuous control tasks. We observe that in each test task, the performance of WD3 consistently outperforms, or at the very least matches, that of the state-of-the-art algorithms\footnote{Our code is available at~\href{https://sites.google.com/view/ictai20-wd3/}{https://sites.google.com/view/ictai20-wd3/}.}.
    Flamingo: Multi-Round Single-Server Secure Aggregation with Applications to Private Federated Learning. (arXiv:2308.09883v2 [cs.CR] UPDATED)
    This paper introduces Flamingo, a system for secure aggregation of data across a large set of clients. In secure aggregation, a server sums up the private inputs of clients and obtains the result without learning anything about the individual inputs beyond what is implied by the final sum. Flamingo focuses on the multi-round setting found in federated learning in which many consecutive summations (averages) of model weights are performed to derive a good model. Previous protocols, such as Bell et al. (CCS '20), have been designed for a single round and are adapted to the federated learning setting by repeating the protocol multiple times. Flamingo eliminates the need for the per-round setup of previous protocols, and has a new lightweight dropout resilience protocol to ensure that if clients leave in the middle of a sum the server can still obtain a meaningful result. Furthermore, Flamingo introduces a new way to locally choose the so-called client neighborhood introduced by Bell et al. These techniques help Flamingo reduce the number of interactions between clients and the server, resulting in a significant reduction in the end-to-end runtime for a full training session over prior work. We implement and evaluate Flamingo and show that it can securely train a neural network on the (Extended) MNIST and CIFAR-100 datasets, and the model converges without a loss in accuracy, compared to a non-private federated learning system.
    Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery. (arXiv:2310.19776v2 [cs.CV] UPDATED)
    In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization, viewing it as an optimal solution to a well-defined problem. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets. This mechanism affords us enhanced control over category granularity, thereby equipping our model to handle fine-grained categories adeptly. Experimental evaluations, bolstered by state-of-the-art benchmark comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. Our code is available at https://github.com/SarahRastegar/InfoSieve.
    LLMs-augmented Contextual Bandit. (arXiv:2311.02268v1 [cs.LG])
    Contextual bandits have emerged as a cornerstone in reinforcement learning, enabling systems to make decisions with partial feedback. However, as contexts grow in complexity, traditional bandit algorithms can face challenges in adequately capturing and utilizing such contexts. In this paper, we propose a novel integration of large language models (LLMs) with the contextual bandit framework. By leveraging LLMs as an encoder, we enrich the representation of the context, providing the bandit with a denser and more informative view. Preliminary results on synthetic datasets demonstrate the potential of this approach, showing notable improvements in cumulative rewards and reductions in regret compared to traditional bandit algorithms. This integration not only showcases the capabilities of LLMs in reinforcement learning but also opens the door to a new era of contextually-aware decision systems.
    Design Of Rubble Analyzer Probe Using ML For Earthquake. (arXiv:2311.02087v1 [cs.SD])
    The earthquake rubble analyzer uses machine learning to detect human presence via ambient sounds, achieving 97.45% accuracy. It also provides real-time environmental data, aiding in assessing survival prospects for trapped individuals, crucial for post-earthquake rescue efforts
    Resist Label Noise with PGM for Graph Neural Networks. (arXiv:2311.02116v1 [cs.LG])
    While robust graph neural networks (GNNs) have been widely studied for graph perturbation and attack, those for label noise have received significantly less attention. Most existing methods heavily rely on the label smoothness assumption to correct noisy labels, which adversely affects their performance on heterophilous graphs. Further, they generally perform poorly in high noise-rate scenarios. To address these problems, in this paper, we propose a novel probabilistic graphical model (PGM) based framework LNP. Given a noisy label set and a clean label set, our goal is to maximize the likelihood of labels in the clean set. We first present LNP-v1, which generates clean labels based on graphs only in the Bayesian network. To further leverage the information of clean labels in the noisy label set, we put forward LNP-v2, which incorporates the noisy label set into the Bayesian network to generate clean labels. The generative process can then be used to predict labels for unlabeled nodes. We conduct extensive experiments to show the robustness of LNP on varying noise types and rates, and also on graphs with different heterophilies. In particular, we show that LNP can lead to inspiring performance in high noise-rate situations.
    Sliced Denoising: A Physics-Informed Molecular Pre-Training Method. (arXiv:2311.02124v1 [q-bio.BM])
    While molecular pre-training has shown great potential in enhancing drug discovery, the lack of a solid physical interpretation in current methods raises concerns about whether the learned representation truly captures the underlying explanatory factors in observed data, ultimately resulting in limited generalization and robustness. Although denoising methods offer a physical interpretation, their accuracy is often compromised by ad-hoc noise design, leading to inaccurate learned force fields. To address this limitation, this paper proposes a new method for molecular pre-training, called sliced denoising (SliDe), which is based on the classical mechanical intramolecular potential theory. SliDe utilizes a novel noise strategy that perturbs bond lengths, angles, and torsion angles to achieve better sampling over conformations. Additionally, it introduces a random slicing approach that circumvents the computationally expensive calculation of the Jacobian matrix, which is otherwise essential for estimating the force field. By aligning with physical principles, SliDe shows a 42\% improvement in the accuracy of estimated force fields compared to current state-of-the-art denoising methods, and thus outperforms traditional baselines on various molecular property prediction tasks.
    Solving MaxSAT with Matrix Multiplication. (arXiv:2311.02101v1 [cs.AI])
    We propose an incomplete algorithm for Maximum Satisfiability (MaxSAT) specifically designed to run on neural network accelerators such as GPUs and TPUs. Given a MaxSAT problem instance in conjunctive normal form, our procedure constructs a Restricted Boltzmann Machine (RBM) with an equilibrium distribution wherein the probability of a Boolean assignment is exponential in the number of clauses it satisfies. Block Gibbs sampling is used to stochastically search the space of assignments with parallel Markov chains. Since matrix multiplication is the main computational primitive for block Gibbs sampling in an RBM, our approach leads to an elegantly simple algorithm (40 lines of JAX) well-suited for neural network accelerators. Theoretical results about RBMs guarantee that the required number of visible and hidden units of the RBM scale only linearly with the number of variables and constant-sized clauses in the MaxSAT instance, ensuring that the computational cost of a Gibbs step scales reasonably with the instance size. Search throughput can be increased by batching parallel chains within a single accelerator as well as by distributing them across multiple accelerators. As a further enhancement, a heuristic based on unit propagation running on CPU is periodically applied to the sampled assignments. Our approach, which we term RbmSAT, is a new design point in the algorithm-hardware co-design space for MaxSAT. We present timed results on a subset of problem instances from the annual MaxSAT Evaluation's Incomplete Unweighted Track for the years 2018 to 2021. When allotted the same running time and CPU compute budget (but no TPUs), RbmSAT outperforms other participating solvers on problems drawn from three out of the four years' competitions. Given the same running time on a TPU cluster for which RbmSAT is uniquely designed, it outperforms all solvers on problems drawn from all four years.
    Embodied Lifelong Learning for Task and Motion Planning. (arXiv:2307.06870v2 [cs.RO] UPDATED)
    A robot deployed in a home over long stretches of time faces a true lifelong learning problem. As it seeks to provide assistance to its users, the robot should leverage any accumulated experience to improve its own knowledge and proficiency. We formalize this setting with a novel formulation of lifelong learning for task and motion planning (TAMP), which endows our learner with the compositionality of TAMP systems. Exploiting the modularity of TAMP, we develop a mixture of generative models that produces candidate continuous parameters for a planner. Whereas most existing lifelong learning approaches determine a priori how data is shared across various models, our approach learns shared and non-shared models and determines which to use online during planning based on auxiliary tasks that serve as a proxy for each model's understanding of a state. Our method exhibits substantial improvements (over time and compared to baselines) in planning success on 2D and BEHAVIOR domains.  ( 2 min )
    Efficient Symbolic Policy Learning with Differentiable Symbolic Expression. (arXiv:2311.02104v1 [cs.LG])
    Deep reinforcement learning (DRL) has led to a wide range of advances in sequential decision-making tasks. However, the complexity of neural network policies makes it difficult to understand and deploy with limited computational resources. Currently, employing compact symbolic expressions as symbolic policies is a promising strategy to obtain simple and interpretable policies. Previous symbolic policy methods usually involve complex training processes and pre-trained neural network policies, which are inefficient and limit the application of symbolic policies. In this paper, we propose an efficient gradient-based learning method named Efficient Symbolic Policy Learning (ESPL) that learns the symbolic policy from scratch in an end-to-end way. We introduce a symbolic network as the search space and employ a path selector to find the compact symbolic policy. By doing so we represent the policy with a differentiable symbolic expression and train it in an off-policy manner which further improves the efficiency. In addition, in contrast with previous symbolic policies which only work in single-task RL because of complexity, we expand ESPL on meta-RL to generate symbolic policies for unseen tasks. Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.
    FLSL: Feature-level Self-supervised Learning. (arXiv:2306.06203v4 [cs.LG] UPDATED)
    Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg,MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation.Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a two-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8)% AP and 46.5% AP in object detection, as well as 40.8 (+2.3)% AP and 42.1% AP in instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV17 object detection on UAVDT, and video instance segmentation on DAVIS 2017.We conclude by presenting visualization and various ablation studies to better understand the success of FLSL. The source code is available at https://github.com/ISL-CV/FLSL.
    DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model. (arXiv:2306.01001v2 [cs.LG] UPDATED)
    Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.  ( 2 min )
    CAT-Walk: Inductive Hypergraph Learning via Set Walks. (arXiv:2306.11147v2 [cs.LG] UPDATED)
    Temporal hypergraphs provide a powerful paradigm for modeling time-dependent, higher-order interactions in complex systems. Representation learning for hypergraphs is essential for extracting patterns of the higher-order interactions that are critically important in real-world problems in social network analysis, neuroscience, finance, etc. However, existing methods are typically designed only for specific tasks or static hypergraphs. We present CAT-Walk, an inductive method that learns the underlying dynamic laws that govern the temporal and structural processes underlying a temporal hypergraph. CAT-Walk introduces a temporal, higher-order walk on hypergraphs, SetWalk, that extracts higher-order causal patterns. CAT-Walk uses a novel adaptive and permutation invariant pooling strategy, SetMixer, along with a set-based anonymization process that hides the identity of hyperedges. Finally, we present a simple yet effective neural network model to encode hyperedges. Our evaluation on 10 hypergraph benchmark datasets shows that CAT-Walk attains outstanding performance on temporal hyperedge prediction benchmarks in both inductive and transductive settings. It also shows competitive performance with state-of-the-art methods for node classification. (https://github.com/ubc-systopia/CATWalk)  ( 2 min )
    Benchmarking Foundation Models with Language-Model-as-an-Examiner. (arXiv:2306.04181v2 [cs.CL] UPDATED)
    Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: this http URL  ( 2 min )
    Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation. (arXiv:2305.05803v4 [cs.CV] UPDATED)
    Weakly supervised semantic segmentation (WSSS) aims to bypass the need for laborious pixel-level annotation by using only image-level annotation. Most existing methods rely on Class Activation Maps (CAM) to derive pixel-level pseudo-labels and use them to train a fully supervised semantic segmentation model. Although these pseudo-labels are class-aware, indicating the coarse regions for particular classes, they are not object-aware and fail to delineate accurate object boundaries. To address this, we introduce a simple yet effective method harnessing the Segment Anything Model (SAM), a class-agnostic foundation model capable of producing fine-grained instance masks of objects, parts, and subparts. We use CAM pseudo-labels as cues to select and combine SAM masks, resulting in high-quality pseudo-labels that are both class-aware and object-aware. Our approach is highly versatile and can be easily integrated into existing WSSS methods without any modification. Despite its simplicity, our approach shows consistent gain over the state-of-the-art WSSS methods on both PASCAL VOC and MS-COCO datasets.  ( 2 min )
    GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction. (arXiv:2306.01951v5 [cs.LG] UPDATED)
    Graph Anomaly Detection (GAD) is a technique used to identify abnormal nodes within graphs, finding applications in network security, fraud detection, social media spam detection, and various other domains. A common method for GAD is Graph Auto-Encoders (GAEs), which encode graph data into node representations and identify anomalies by assessing the reconstruction quality of the graphs based on these representations. However, existing GAE models are primarily optimized for direct link reconstruction, resulting in nodes connected in the graph being clustered in the latent space. As a result, they excel at detecting cluster-type structural anomalies but struggle with more complex structural anomalies that do not conform to clusters. To address this limitation, we propose a novel solution called GAD-NR, a new variant of GAE that incorporates neighborhood reconstruction for graph anomaly detection. GAD-NR aims to reconstruct the entire neighborhood of a node, encompassing the local structure, self-attributes, and neighbor attributes, based on the corresponding node representation. By comparing the neighborhood reconstruction loss between anomalous nodes and normal nodes, GAD-NR can effectively detect any anomalies. Extensive experimentation conducted on six real-world datasets validates the effectiveness of GAD-NR, showcasing significant improvements (by up to 30% in AUC) over state-of-the-art competitors. The source code for GAD-NR is openly available. Importantly, the comparative analysis reveals that the existing methods perform well only in detecting one or two types of anomalies out of the three types studied. In contrast, GAD-NR excels at detecting all three types of anomalies across the datasets, demonstrating its comprehensive anomaly detection capabilities.
    C-STS: Conditional Semantic Textual Similarity. (arXiv:2305.15093v2 [cs.CL] UPDATED)
    Semantic textual similarity (STS), a cornerstone task in NLP, measures the degree of similarity between a pair of sentences, and has broad application in fields such as information retrieval and natural language understanding. However, sentence similarity can be inherently ambiguous, depending on the specific aspect of interest. We resolve this ambiguity by proposing a novel task called Conditional STS (C-STS) which measures sentences' similarity conditioned on an feature described in natural language (hereon, condition). As an example, the similarity between the sentences "The NBA player shoots a three-pointer." and "A man throws a tennis ball into the air to serve." is higher for the condition "The motion of the ball" (both upward) and lower for "The size of the ball" (one large and one small). C-STS's advantages are two-fold: (1) it reduces the subjectivity and ambiguity of STS and (2) enables fine-grained language model evaluation through diverse natural language conditions. We put several state-of-the-art models to the test, and even those performing well on STS (e.g. SimCSE, Flan-T5, and GPT-4) find C-STS challenging; all with Spearman correlation scores below 50. To encourage a more comprehensive evaluation of semantic similarity and natural language understanding, we make nearly 19K C-STS examples and code available for others to train and test their models.
    Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model. (arXiv:2306.05720v2 [cs.CV] UPDATED)
    Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process$-$well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output. Project page: https://yc015.github.io/scene-representation-diffusion-model/
    Continually Improving Extractive QA via Human Feedback. (arXiv:2305.12473v2 [cs.CL] UPDATED)
    We study continually improving an extractive question answering (QA) system via human user feedback. We design and deploy an iterative approach, where information-seeking users ask questions, receive model-predicted answers, and provide feedback. We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time. Our experiments show effective improvement from user feedback of extractive QA models over time across different data regimes, including significant potential for domain adaptation.
    Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning. (arXiv:2310.20587v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Our project website is $\href{https://lamo2023.github.io}{\text{this https URL}}$.
    Differentially Private Federated Clustering over Non-IID Data. (arXiv:2301.00955v3 [cs.DC] CROSS LISTED)
    In this paper, we investigate federated clustering (FedC) problem, that aims to accurately partition unlabeled data samples distributed over massive clients into finite clusters under the orchestration of a parameter server, meanwhile considering data privacy. Though it is an NP-hard optimization problem involving real variables denoting cluster centroids and binary variables denoting the cluster membership of each data sample, we judiciously reformulate the FedC problem into a non-convex optimization problem with only one convex constraint, accordingly yielding a soft clustering solution. Then a novel FedC algorithm using differential privacy (DP) technique, referred to as DP-FedC, is proposed in which partial clients participation and multiple local model updating steps are also considered. Furthermore, various attributes of the proposed DP-FedC are obtained through theoretical analyses of privacy protection and convergence rate, especially for the case of non-identically and independently distributed (non-i.i.d.) data, that ideally serve as the guidelines for the design of the proposed DP-FedC. Then some experimental results on two real datasets are provided to demonstrate the efficacy of the proposed DP-FedC together with its much superior performance over some state-of-the-art FedC algorithms, and the consistency with all the presented analytical results.
    Tight conditions for when the NTK approximation is valid. (arXiv:2305.13141v3 [cs.LG] UPDATED)
    We study when the neural tangent kernel (NTK) approximation is valid for training a model with the square loss. In the lazy training setting of Chizat et al. 2019, we show that rescaling the model by a factor of $\alpha = O(T)$ suffices for the NTK approximation to be valid until training time $T$. Our bound is tight and improves on the previous bound of Chizat et al. 2019, which required a larger rescaling factor of $\alpha = O(T^2)$.
    Reduce Computational Complexity for Convolutional Layers by Skipping Zeros. (arXiv:2306.15951v3 [cs.LG] UPDATED)
    Convolutional neural networks necessitate good algorithms to reduce complexity, and sufficient utilization of parallel processors for acceleration. Within convolutional layers, there are three types of operators: convolution used in forward propagation, deconvolution and dilated-convolution utilized in backward propagation. During the execution of these operators, zeros are typically added to tensors, leading to redundant calculations and unnecessary strain on hardware. To circumvent these inefficiencies, we propose the C-K-S algorithm, accompanied by efficient GPU implementations. C-K-S trims filters to exclude zero-padding. For deconvolution and dilated-convolution, C-K-S transforms sparse tensors into dense tensors, and standardizes the local computational rules to simplify the hardware control. The experimental results demonstrate that C-K-S offers good performance in terms of speed and convergence, surpassing the capabilities of PyTorch and cuDNN in certain scenarios.  ( 2 min )
    Moment Matching Denoising Gibbs Sampling. (arXiv:2305.11650v2 [stat.ML] UPDATED)
    Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a `noisy' model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.
    Bridging RL Theory and Practice with the Effective Horizon. (arXiv:2304.09853v2 [cs.LG] UPDATED)
    Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon
    Multilevel Diffusion: Infinite Dimensional Score-Based Diffusion Models for Image Generation. (arXiv:2303.04772v3 [cs.LG] UPDATED)
    Score-based diffusion models (SBDM) have recently emerged as state-of-the-art approaches for image generation. Existing SBDMs are typically formulated in a finite-dimensional setting, where images are considered as tensors of finite size. This paper develops SBDMs in the infinite-dimensional setting, that is, we model the training data as functions supported on a rectangular domain. Besides the quest for generating images at ever higher resolution, our primary motivation is to create a well-posed infinite-dimensional learning problem so that we can discretize it consistently on multiple resolution levels. We thereby intend to obtain diffusion models that generalize across different resolution levels and improve the efficiency of the training process. We demonstrate how to overcome two shortcomings of current SBDM approaches in the infinite-dimensional setting. First, we modify the forward process to ensure that the latent distribution is well-defined in the infinite-dimensional setting using the notion of trace class operators. We derive the reverse processes for finite approximations. Second, we illustrate that approximating the score function with an operator network is beneficial for multilevel training. After deriving the convergence of the discretization and the approximation of multilevel training, we implement an infinite-dimensional SBDM approach and show the first promising results on MNIST and Fashion-MNIST, underlining our developed theory.
    Detecting Language Model Attacks with Perplexity. (arXiv:2308.14132v2 [cs.CL] UPDATED)
    A novel hack involving Large Language Models (LLMs) has emerged, leveraging adversarial suffixes to trick models into generating perilous responses. This method has garnered considerable attention from reputable media outlets such as the New York Times and Wired, thereby influencing public perception regarding the security and safety of LLMs. In this study, we advocate the utilization of perplexity as one of the means to recognize such potential attacks. The underlying concept behind these hacks revolves around appending an unusually constructed string of text to a harmful query that would otherwise be blocked. This maneuver confuses the protective mechanisms and tricks the model into generating a forbidden response. Such scenarios could result in providing detailed instructions to a malicious user for constructing explosives or orchestrating a bank heist. Our investigation demonstrates the feasibility of employing perplexity, a prevalent natural language processing metric, to detect these adversarial tactics before generating a forbidden response. By evaluating the perplexity of queries with and without such adversarial suffixes using an open-source LLM, we discovered that nearly 90 percent were above a perplexity of 1000. This contrast underscores the efficacy of perplexity for detecting this type of exploit.
    An Operator Learning Framework for Spatiotemporal Super-resolution of Scientific Simulations. (arXiv:2311.02328v1 [cs.LG])
    In numerous contexts, high-resolution solutions to partial differential equations are required to capture faithfully essential dynamics which occur at small spatiotemporal scales, but these solutions can be very difficult and slow to obtain using traditional methods due to limited computational resources. A recent direction to circumvent these computational limitations is to use machine learning techniques for super-resolution, to reconstruct high-resolution numerical solutions from low-resolution simulations which can be obtained more efficiently. The proposed approach, the Super Resolution Operator Network (SROpNet), frames super-resolution as an operator learning problem and draws inspiration from existing architectures to learn continuous representations of solutions to parametric differential equations from low-resolution approximations, which can then be evaluated at any desired location. In addition, no restrictions are imposed on the locations of (the fixed number of) spatiotemporal sensors at which the low-resolution approximations are provided, thereby enabling the consideration of a broader spectrum of problems arising in practice, for which many existing super-resolution approaches are not well-suited.
    Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness. (arXiv:2305.13946v2 [cs.LG] UPDATED)
    This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is "easy," with per-iteration time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.  ( 2 min )
    Multi-label Classification with High-rank and High-order Label Correlations. (arXiv:2207.04197v2 [cs.LG] UPDATED)
    Exploiting label correlations is important to multi-label classification. Previous methods capture the high-order label correlations mainly by transforming the label matrix to a latent label space with low-rank matrix factorization. However, the label matrix is generally a full-rank or approximate full-rank matrix, making the low-rank factorization inappropriate. Besides, in the latent space, the label correlations will become implicit. To this end, we propose a simple yet effective method to depict the high-order label correlations explicitly, and at the same time maintain the high-rank of the label matrix. Moreover, we estimate the label correlations and infer model parameters simultaneously via the local geometric structure of the input to achieve mutual enhancement. Comparative studies over twelve benchmark data sets validate the effectiveness of the proposed algorithm in multi-label classification. The exploited high-order label correlations are consistent with common sense empirically. Our code is publicly available at https://github.com/Chongjie-Si/HOMI.
    Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v4 [stat.ML] UPDATED)
    With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v3 [stat.ML] UPDATED)
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
    VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models. (arXiv:2306.06874v3 [cs.CR] UPDATED)
    Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs. Our code is available on GitHub: \url{https://github.com/IBM/villandiffusion}  ( 2 min )
    A Novel Site-Agnostic Multimodal Deep Learning Model to Identify Pro-Eating Disorder Content on Social Media. (arXiv:2307.06775v4 [cs.LG] UPDATED)
    Over the last decade, there has been a vast increase in eating disorder diagnoses and eating disorder-attributed deaths, reaching their zenith during the Covid-19 pandemic. This immense growth derived in part from the stressors of the pandemic but also from increased exposure to social media, which is rife with content that promotes eating disorders. This study aimed to create a multimodal deep learning model that can determine if a given social media post promotes eating disorders based on a combination of visual and textual data. A labeled dataset of Tweets was collected from Twitter, recently rebranded as X, upon which twelve deep learning models were trained and evaluated. Based on model performance, the most effective deep learning model was the multimodal fusion of the RoBERTa natural language processing model and the MaxViT image classification model, attaining accuracy and F1 scores of 95.9% and 0.959, respectively. The RoBERTa and MaxViT fusion model, deployed to classify an unlabeled dataset of posts from the social media sites Tumblr and Reddit, generated results akin to those of previous research studies that did not employ artificial intelligence-based techniques, indicating that deep learning models can develop insights congruent to those of researchers. Additionally, the model was used to conduct a time-series analysis of yet unseen Tweets from eight Twitter hashtags, uncovering that, since 2014, the relative abundance of content that promotes eating disorders has decreased drastically within those communities. Despite this reduction, by 2018, content that promotes eating disorders had either stopped declining or increased in ampleness anew on those hashtags.  ( 3 min )
    Deep Learning with Kernels through RKHM and the Perron-Frobenius Operator. (arXiv:2305.13588v2 [stat.ML] UPDATED)
    Reproducing kernel Hilbert $C^*$-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of $C^*$-algebra, and the Perron-Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron-Frobenius operators. By virtue of $C^*$-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that $C^*$-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.
    Bi-directional Training for Composed Image Retrieval via Text Prompt Learning. (arXiv:2303.16604v2 [cs.CV] UPDATED)
    Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as described by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures with minimum changes, which improves the performance of the model. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves competitive performance. Our code is released at https://github.com/Cuberick-Orion/Bi-Blip4CIR.
    Is RLHF More Difficult than Standard RL?. (arXiv:2306.14111v2 [cs.LG] UPDATED)
    Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.  ( 2 min )
    BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular Representation. (arXiv:2211.13979v3 [cs.LG] UPDATED)
    Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets, which are time-consuming, computationally expensive, and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.
    Using multimodal learning and deep generative models for corporate bankruptcy prediction. (arXiv:2211.08405v4 [q-fin.RM] UPDATED)
    Textual data from financial filings, e.g., the Management's Discussion \& Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. This research introduces for the first time, to the best of our knowledge, the concept of multimodal learning in bankruptcy prediction models to solve the problem that for some companies we are unable to obtain the MDA text. We use the Conditional Multimodal Discriminative (CMMD) model to learn multimodal representations that embed information from accounting, market, and textual modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions. This fact makes the use of bankruptcy prediction models using textual data realistic and possible, since accounting and market data are available for all companies unlike textual data. The empirical results in this research show that the classification performance of our proposed methodology is superior compared to that of a large number of traditional classifier models. We also show that our proposed methodology solves the limitation of previous bankruptcy models using textual data, as they can only make predictions for a small proportion of companies.
    Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches. (arXiv:2206.03827v7 [stat.ML] UPDATED)
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.
    Estimation and inference for transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v3 [stat.ML] UPDATED)
    Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and heavy tails are insufficiently accounted for by current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate heterogeneity and heavy tails in the source and target domains. We establish error bounds of transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of high-dimensional quantile regression coefficients by advocating a double transfer learning estimator, which is one-step debiased estimator for the transfer learning estimator wherein the technique of transfer learning is designed again. By adopting data-splitting technique, we advocate a transferability detection approach that guarantees to circumvent negative transfer and identify transferable sources with high probability. Simulation results demonstrate that the proposed method exhibits some favorable and compelling performances and the practical utility is further illustrated by analyzing a real example.
    PRISM: Progressive Restoration for Scene Graph-based Image Manipulation. (arXiv:2311.02247v1 [cs.LG])
    Scene graphs have emerged as accurate descriptive priors for image generation and manipulation tasks, however, their complexity and diversity of the shapes and relations of objects in data make it challenging to incorporate them into the models and generate high-quality results. To address these challenges, we propose PRISM, a novel progressive multi-head image manipulation approach to improve the accuracy and quality of the manipulated regions in the scene. Our image manipulation framework is trained using an end-to-end denoising masked reconstruction proxy task, where the masked regions are progressively unmasked from the outer regions to the inner part. We take advantage of the outer part of the masked area as they have a direct correlation with the context of the scene. Moreover, our multi-head architecture simultaneously generates detailed object-specific regions in addition to the entire image to produce higher-quality images. Our model outperforms the state-of-the-art methods in the semantic image manipulation task on the CLEVR and Visual Genome datasets. Our results demonstrate the potential of our approach for enhancing the quality and precision of scene graph-based image manipulation.
    FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation. (arXiv:2311.02326v1 [cs.LG])
    Drug-Target Interaction (DTI) prediction is vital for drug discovery, yet challenges persist in achieving model interpretability and optimizing performance. We propose a novel transformer-based model, FragXsiteDTI, that aims to address these challenges in DTI prediction. Notably, FragXsiteDTI is the first DTI model to simultaneously leverage drug molecule fragments and protein pockets. Our information-rich representations for both proteins and drugs offer a detailed perspective on their interaction. Inspired by the Perceiver IO framework, our model features a learnable latent array, initially interacting with protein binding site embeddings using cross-attention and later refined through self-attention and used as a query to the drug fragments in the drug's cross-attention transformer block. This learnable query array serves as a mediator and enables seamless information translation, preserving critical nuances in drug-protein interactions. Our computational results on three benchmarking datasets demonstrate the superior predictive power of our model over several state-of-the-art models. We also show the interpretability of our model in terms of the critical components of both target proteins and drug molecules within drug-target pairs.
    Error-bounded Approximate Time Series Joins Using Compact Dictionary Representations of Time Series. (arXiv:2112.12965v2 [cs.DB] UPDATED)
    The matrix profile is an effective data mining tool that provides similarity join functionality for time series data. Users of the matrix profile can either join a time series with itself using intra-similarity join (i.e., self-join) or join a time series with another time series using inter-similarity join. By invoking either or both types of joins, the matrix profile can help users discover both conserved and anomalous structures in the data. Since the introduction of the matrix profile five years ago, multiple efforts have been made to speed up the computation with approximate joins; however, the majority of these efforts only focus on self-joins. In this work, we show that it is possible to efficiently perform approximate inter-time series similarity joins with error bounded guarantees by creating a compact "dictionary" representation of time series. Using the dictionary representation instead of the original time series, we are able to improve the throughput of an anomaly mining system by at least 20X, with essentially no decrease in accuracy. As a side effect, the dictionaries also summarize the time series in a semantically meaningful way and can provide intuitive and actionable insights. We demonstrate the utility of our dictionary-based inter-time series similarity joins on domains as diverse as medicine and transportation.
    Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v2 [physics.flu-dyn] UPDATED)
    Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on $L_2$ spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.
    Successive Model-Agnostic Meta-Learning for Few-Shot Fault Time Series Prognosis. (arXiv:2311.02300v1 [cs.LG])
    Meta learning is a promising technique for solving few-shot fault prediction problems, which have attracted the attention of many researchers in recent years. Existing meta-learning methods for time series prediction, which predominantly rely on random and similarity matching-based task partitioning, face three major limitations: (1) feature exploitation inefficiency; (2) suboptimal task data allocation; and (3) limited robustness with small samples. To overcome these limitations, we introduce a novel 'pseudo meta-task' partitioning scheme that treats a continuous time period of a time series as a meta-task, composed of multiple successive short time periods. Employing continuous time series as pseudo meta-tasks allows our method to extract more comprehensive features and relationships from the data, resulting in more accurate predictions. Moreover, we introduce a differential algorithm to enhance the robustness of our method across different datasets. Through extensive experiments on several fault and time series prediction datasets, we demonstrate that our approach substantially enhances prediction performance and generalization capability under both few-shot and general conditions.
    MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning. (arXiv:2311.02303v1 [cs.LG])
    Code LLMs have emerged as a specialized research field, with remarkable studies dedicated to enhancing model's coding capabilities through fine-tuning on pre-trained models. Previous fine-tuning approaches were typically tailored to specific downstream tasks or scenarios, which meant separate fine-tuning for each task, requiring extensive training resources and posing challenges in terms of deployment and maintenance. Furthermore, these approaches failed to leverage the inherent interconnectedness among different code-related tasks. To overcome these limitations, we present a multi-task fine-tuning framework, MFTcoder, that enables simultaneous and parallel fine-tuning on multiple tasks. By incorporating various loss functions, we effectively address common challenges in multi-task learning, such as data imbalance, varying difficulty levels, and inconsistent convergence speeds. Extensive experiments have conclusively demonstrated that our multi-task fine-tuning approach outperforms both individual fine-tuning on single tasks and fine-tuning on a mixed ensemble of tasks. Moreover, MFTcoder offers efficient training capabilities, including efficient data tokenization modes and PEFT fine-tuning, resulting in significantly improved speed compared to traditional fine-tuning methods. MFTcoder seamlessly integrates with several mainstream open-source LLMs, such as CodeLLama and Qwen. Leveraging the CodeLLama foundation, our MFTcoder fine-tuned model, \textsc{CodeFuse-CodeLLama-34B}, achieves an impressive pass@1 score of 74.4\% on the HumaneEval benchmark, surpassing GPT-4 performance (67\%, zero-shot). MFTCoder is open-sourced at \url{https://github.com/codefuse-ai/MFTCOder}
    Multi-scale Time-stepping of Partial Differential Equations with Transformers. (arXiv:2311.02225v1 [cs.LG])
    Developing fast surrogates for Partial Differential Equations (PDEs) will accelerate design and optimization in almost all scientific and engineering applications. Neural networks have been receiving ever-increasing attention and demonstrated remarkable success in computational modeling of PDEs, however; their prediction accuracy is not at the level of full deployment. In this work, we utilize the transformer architecture, the backbone of numerous state-of-the-art AI models, to learn the dynamics of physical systems as the mixing of spatial patterns learned by a convolutional autoencoder. Moreover, we incorporate the idea of multi-scale hierarchical time-stepping to increase the prediction speed and decrease accumulated error over time. Our model achieves similar or better results in predicting the time-evolution of Navier-Stokes equations compared to the powerful Fourier Neural Operator (FNO) and two transformer-based neural operators OFormer and Galerkin Transformer.
    RigLSTM: Recurrent Independent Grid LSTM for Generalizable Sequence Learning. (arXiv:2311.02123v1 [cs.LG])
    Sequential processes in real-world often carry a combination of simple subsystems that interact with each other in certain forms. Learning such a modular structure can often improve the robustness against environmental changes. In this paper, we propose recurrent independent Grid LSTM (RigLSTM), composed of a group of independent LSTM cells that cooperate with each other, for exploiting the underlying modular structure of the target task. Our model adopts cell selection, input feature selection, hidden state selection, and soft state updating to achieve a better generalization ability on the basis of the recent Grid LSTM for the tasks where some factors differ between training and evaluation. Specifically, at each time step, only a fraction of cells are activated, and the activated cells select relevant inputs and cells to communicate with. At the end of one time step, the hidden states of the activated cells are updated by considering the relevance between the inputs and the hidden states from the last and current time steps. Extensive experiments on diversified sequential modeling tasks are conducted to show the superior generalization ability when there exist changes in the testing environment. Source code is available at \url{https://github.com/ziyuwwang/rig-lstm}.
    Using General Value Functions to Learn Domain-Backed Inventory Management Policies. (arXiv:2311.02125v1 [cs.LG])
    We consider the inventory management problem, where the goal is to balance conflicting objectives such as availability and wastage of a large range of products in a store. We propose a reinforcement learning (RL) approach that utilises General Value Functions (GVFs) to derive domain-backed inventory replenishment policies. The inventory replenishment decisions are modelled as a sequential decision making problem, which is challenging due to uncertain demand and the existence of aggregate (cross-product) constraints. In existing literature, GVFs have primarily been used for auxiliary task learning. We use this capability to train GVFs on domain-critical characteristics such as prediction of stock-out probability and wastage quantity. Using this domain expertise for more effective exploration, we train an RL agent to compute the inventory replenishment quantities for a large range of products (up to 6000 in the reported experiments), which share aggregate constraints such as the total weight/volume per delivery. Additionally, we show that the GVF predictions can be used to provide additional domain-backed insights into the decisions proposed by the RL agent. Finally, since the environment dynamics are fully transferred, the trained GVFs can be used for faster adaptation to vastly different business objectives (for example, due to the start of a promotional period or due to deployment in a new customer environment).
    Bayesian Optimization of Function Networks with Partial Evaluations. (arXiv:2311.02146v1 [stat.ML])
    Bayesian optimization is a framework for optimizing functions that are costly or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is computed via a network of functions, each taking as input the output of previous nodes in the network and additional parameters. Exploiting this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks are required to evaluate the full network at each iteration. However, many real-world applications allow evaluating nodes individually. To take advantage of this opportunity, we propose a novel knowledge gradient acquisition function for BOFN that chooses which node to evaluate as well as the inputs for that node in a cost-aware fashion. This approach can dramatically reduce query costs by allowing the evaluation of part of the network at a lower cost relative to evaluating the entire network. We provide an efficient approach to optimizing our acquisition function and show it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks.
    What Knowledge Gets Distilled in Knowledge Distillation?. (arXiv:2205.16004v3 [cs.CV] UPDATED)
    Knowledge distillation aims to transfer useful information from a teacher network to a student network, with the primary goal of improving the student's performance for the task at hand. Over the years, there has a been a deluge of novel techniques and use cases of knowledge distillation. Yet, despite the various improvements, there seems to be a glaring gap in the community's fundamental understanding of the process. Specifically, what is the knowledge that gets distilled in knowledge distillation? In other words, in what ways does the student become similar to the teacher? Does it start to localize objects in the same way? Does it get fooled by the same adversarial samples? Does its data invariance properties become similar? Our work presents a comprehensive study to try to answer these questions. We show that existing methods can indeed indirectly distill these properties beyond improving task performance. We further study why knowledge distillation might work this way, and show that our findings have practical implications as well.
    A Robust Backpropagation-Free Framework for Images. (arXiv:2206.01820v2 [cs.NE] UPDATED)
    While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed-forward activities in order to conduct backward propagation, a biologically implausible process. This is known as the "weight transport problem". Therefore, in this work, we present a more biologically plausible approach towards solving the weight transport problem for image data. This approach, which we name the error kernel driven activation alignment (EKDAA) algorithm, accomplishes through the introduction of locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions; however, its backward error computation involves adaptive error kernels that propagate local error signals through the network. The efficacy of EKDAA is demonstrated by performing visual-recognition tasks on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, in order to demonstrate its non-reliance on gradient computations, results are presented for an EKDAA trained CNN that employs a non-differentiable activation function.
    Generative Artificial Intelligence in Healthcare: Ethical Considerations and Assessment Checklist. (arXiv:2311.02107v1 [cs.LG])
    The widespread use of ChatGPT and other emerging technology powered by generative artificial intelligence (AI) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare. However, less clear is how to resolve such issues beyond following guidelines and regulations that are still under discussion and development. On the other hand, other types of generative AI have been used to synthesize images and other types of data for research and practical purposes, which have resolved some ethical issues and exposed other ethical issues, but such technology is less often the focus of ongoing ethical discussions. Here we highlight gaps in current ethical discussions of generative AI via a systematic scoping review of relevant existing research in healthcare, and reduce the gaps by proposing an ethics checklist for comprehensive assessment and transparent documentation of ethical discussions in generative AI development. While the checklist can be readily integrated into the current peer review and publication system to enhance generative AI research, it may also be used in broader settings to disclose ethics-related considerations in generative AI-powered products (or real-life applications of such products) to help users establish reasonable trust in their capabilities.
    Using reinforcement learning to autonomously identify sources of error for agents in group missions. (arXiv:2107.09232v4 [cs.RO] UPDATED)
    When agents swarm to execute a mission, some of them frequently exhibit sudden failure, as observed from the command base. It is generally difficult to determine whether a failure is caused by actuators (hypothesis, $h_a$) or sensors (hypothesis, $h_s$) by solely relying on the communication between the command base and concerning agent. However, by instigating collusion between the agents, the cause of failure can be identified; in other words, we expect to detect corresponding displacements for $h_a$ but not for $h_s$. In this study, we considered the question as to whether artificial intelligence can autonomously generate an action plan $\boldsymbol{g}$ to pinpoint the cause as aforedescribed. Because the expected response to $\boldsymbol{g}$ generally depends upon the adopted hypothesis [let the difference be denoted by $D(\boldsymbol{g})$], a formulation that uses $D\left(\boldsymbol{g}\right)$ to pinpoint the cause can be made. Although a $\boldsymbol{g}^*$ that maximizes $D(\boldsymbol{g})$ would be a suitable action plan for this task, such an optimization is difficult to achieve using the conventional gradient method, as $D(\boldsymbol{g})$ becomes nonzero in rare events such as collisions with other agents, and most swarm actions $\boldsymbol{g}$ give $D(\boldsymbol{g})=0$. In other words, throughout almost the entire space of $\boldsymbol{g}$, $D(\boldsymbol{g})$ has zero gradient, and the gradient method is not applicable. To overcome this problem, we formulated an action plan using Q-table reinforcement learning. Surprisingly, the optimal action plan generated via reinforcement learning presented a human-like solution to pinpoint the problem by colliding other agents with the failed agent. Using this simple prototype, we demonstrated the potential of applying Q-table reinforcement learning methods to plan autonomous actions to pinpoint the causes of failure.
    Machine learning's own Industrial Revolution. (arXiv:2311.02278v1 [cs.LG])
    Machine learning is expected to enable the next Industrial Revolution. However, lacking standardized and automated assembly networks, ML faces significant challenges to meet ever-growing enterprise demands and empower broad industries. In the Perspective, we argue that ML needs to first complete its own Industrial Revolution, elaborate on how to best achieve its goals, and discuss new opportunities to enable rapid translation from ML's innovation frontier to mass production and utilization.
    Thermal Face Image Classification using Deep Learning Techniques. (arXiv:2311.02314v1 [cs.CV])
    Thermal images have various applications in security, medical and industrial domains. This paper proposes a practical deep-learning approach for thermal image classification. Accurate and efficient classification of thermal images poses a significant challenge across various fields due to the complex image content and the scarcity of annotated datasets. This work uses a convolutional neural network (CNN) architecture, specifically ResNet-50 and VGGNet-19, to extract features from thermal images. This work also applied Kalman filter on thermal input images for image denoising. The experimental results demonstrate the effectiveness of the proposed approach in terms of accuracy and efficiency.
    Joint Problems in Learning Multiple Dynamical Systems. (arXiv:2311.02181v1 [math.OC])
    Clustering of time series is a well-studied problem, with applications ranging from quantitative, personalized models of metabolism obtained from metabolite concentrations to state discrimination in quantum information theory. We consider a variant, where given a set of trajectories and a number of parts, we jointly partition the set of trajectories and learn linear dynamical system (LDS) models for each part, so as to minimize the maximum error across all the models. We present globally convergent methods and EM heuristics, accompanied by promising computational results.
    RDumb: A simple approach that questions our progress in continual test-time adaptation. (arXiv:2306.05401v2 [cs.LG] UPDATED)
    Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time. While early work tested these algorithms for individual fixed distribution shifts, recent work proposed and applied methods for continual adaptation over long timescales. To examine the reported progress in the field, we propose the Continually Changing Corruptions (CCC) benchmark to measure asymptotic performance of TTA techniques. We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model, including models specifically proposed to be robust to performance collapse. In addition, we introduce a simple baseline, "RDumb", that periodically resets the model to its pretrained state. RDumb performs better or on par with the previously proposed state-of-the-art in all considered benchmarks. Our results show that previous TTA approaches are neither effective at regularizing adaptation to avoid collapse nor able to outperform a simplistic resetting strategy.
    Towards objective and systematic evaluation of bias in medical imaging AI. (arXiv:2311.02115v1 [cs.CV])
    Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of disparities in performance between subgroups. Since not all sources of biases in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess how those biases are encoded in models, and how capable bias mitigation methods are at ameliorating performance disparities. In this article, we introduce a novel analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. We developed and tested this framework for conducting controlled in silico trials to assess bias in medical imaging AI using a tool for generating synthetic magnetic resonance images with known disease effects and sources of bias. The feasibility is showcased by using three counterfactual bias scenarios to measure the impact of simulated bias effects on a convolutional neural network (CNN) classifier and the efficacy of three bias mitigation strategies. The analysis revealed that the simulated biases resulted in expected subgroup performance disparities when the CNN was trained on the synthetic datasets. Moreover, reweighing was identified as the most successful bias mitigation strategy for this setup, and we demonstrated how explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. Developing fair AI models is a considerable challenge given that many and often unknown sources of biases can be present in medical imaging datasets. In this work, we present a novel methodology to objectively study the impact of biases and mitigation strategies on deep learning pipelines, which can support the development of clinical AI that is robust and responsible.
    Variational Autoencoders for Noise Reduction in Industrial LLRF Systems. (arXiv:2311.02096v1 [physics.acc-ph])
    Industrial particle accelerators inherently operate in much dirtier environments than typical research accelerators. This leads to an increase in noise both in the RF system and in other electronic systems. Combined with the fact that industrial accelerators are mass produced, there is less attention given to optimizing the performance of an individual system. As a result, industrial systems tend to under perform considering their hardware hardware capabilities. With the growing demand for accelerators for medical sterilization, food irradiation, cancer treatment, and imaging, improving the signal processing of these machines will increase the margin for the deployment of these systems. Our work is focusing on using machine learning techniques to reduce the noise of RF signals used for pulse-to-pulse feedback in industrial accelerators. We will review our algorithms, simulation results, and results working with measured data. We will then discuss next steps for deployment and testing on an industrial system.
    Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. (arXiv:2311.02103v1 [cs.LG])
    Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program. It also introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and library calls in a single representation to enable cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on large language models show that Relax delivers performance competitive with state-of-the-art hand-optimized systems across platforms and enables deployment of emerging dynamic models to a broader set of environments, including mobile phones, embedded devices, and web browsers.
    Cooperative Network Learning for Large-Scale and Decentralized Graphs. (arXiv:2311.02117v1 [cs.LG])
    Graph research, the systematic study of interconnected data points represented as graphs, plays a vital role in capturing intricate relationships within networked systems. However, in the real world, as graphs scale up, concerns about data security among different data-owning agencies arise, hindering information sharing and, ultimately, the utilization of graph data. Therefore, establishing a mutual trust mechanism among graph agencies is crucial for unlocking the full potential of graphs. Here, we introduce a Cooperative Network Learning (CNL) framework to ensure secure graph computing for various graph tasks. Essentially, this CNL framework unifies the local and global perspectives of GNN computing with distributed data for an agency by virtually connecting all participating agencies as a global graph without a fixed central coordinator. Inter-agency computing is protected by various technologies inherent in our framework, including homomorphic encryption and secure transmission. Moreover, each agency has a fair right to design or employ various graph learning models from its local or global perspective. Thus, CNL can collaboratively train GNN models based on decentralized graphs inferred from local and global graphs. Experiments on contagion dynamics prediction and traditional graph tasks (i.e., node classification and link prediction) demonstrate that our CNL architecture outperforms state-of-the-art GNNs developed at individual sites, revealing that CNL can provide a reliable, fair, secure, privacy-preserving, and global perspective to build effective and personalized models for network applications. We hope this framework will address privacy concerns in graph-related research and integrate decentralized graph data structures to benefit the network research community in cooperation and innovation.
  • Open

    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v3 [cs.LG] UPDATED)
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    Multilevel Diffusion: Infinite Dimensional Score-Based Diffusion Models for Image Generation. (arXiv:2303.04772v3 [cs.LG] UPDATED)
    Score-based diffusion models (SBDM) have recently emerged as state-of-the-art approaches for image generation. Existing SBDMs are typically formulated in a finite-dimensional setting, where images are considered as tensors of finite size. This paper develops SBDMs in the infinite-dimensional setting, that is, we model the training data as functions supported on a rectangular domain. Besides the quest for generating images at ever higher resolution, our primary motivation is to create a well-posed infinite-dimensional learning problem so that we can discretize it consistently on multiple resolution levels. We thereby intend to obtain diffusion models that generalize across different resolution levels and improve the efficiency of the training process. We demonstrate how to overcome two shortcomings of current SBDM approaches in the infinite-dimensional setting. First, we modify the forward process to ensure that the latent distribution is well-defined in the infinite-dimensional setting using the notion of trace class operators. We derive the reverse processes for finite approximations. Second, we illustrate that approximating the score function with an operator network is beneficial for multilevel training. After deriving the convergence of the discretization and the approximation of multilevel training, we implement an infinite-dimensional SBDM approach and show the first promising results on MNIST and Fashion-MNIST, underlining our developed theory.
    Online covariance estimation for stochastic gradient descent under Markovian sampling. (arXiv:2308.01481v2 [math.ST] UPDATED)
    We investigate the online overlapping batch-means covariance estimator for Stochastic Gradient Descent (SGD) under Markovian sampling. Convergence rates of order $O\big(\sqrt{d}\,n^{-1/8}(\log n)^{1/4}\big)$ and $O\big(\sqrt{d}\,n^{-1/8}\big)$ are established under state-dependent and state-independent Markovian sampling, respectively, where $d$ is the dimensionality and $n$ denotes observations or SGD iterations. These rates match the best-known convergence rate for independent and identically distributed (i.i.d) data. Our analysis overcomes significant challenges that arise due to Markovian sampling, leading to the introduction of additional error terms and complex dependencies between the blocks of the batch-means covariance estimator. Moreover, we establish the convergence rate for the first four moments of the $\ell_2$ norm of the error of SGD dynamics under state-dependent Markovian data, which holds potential interest as an independent result. Numerical illustrations provide confidence intervals for SGD in linear and logistic regression models under Markovian sampling. Additionally, our method is applied to the strategic classification with logistic regression, where adversaries adaptively modify features during training to affect target class classification.
    Posterior Sampling with Delayed Feedback for Reinforcement Learning with Linear Function Approximation. (arXiv:2310.18919v2 [cs.LG] UPDATED)
    Recent studies in reinforcement learning (RL) have made significant progress by leveraging function approximation to alleviate the sample complexity hurdle for better performance. Despite the success, existing provably efficient algorithms typically rely on the accessibility of immediate feedback upon taking actions. The failure to account for the impact of delay in observations can significantly degrade the performance of real-world systems due to the regret blow-up. In this work, we tackle the challenge of delayed feedback in RL with linear function approximation by employing posterior sampling, which has been shown to empirically outperform the popular UCB algorithms in a wide range of regimes. We first introduce Delayed-PSVI, an optimistic value-based algorithm that effectively explores the value function space via noise perturbation with posterior sampling. We provide the first analysis for posterior sampling algorithms with delayed feedback in RL and show our algorithm achieves $\widetilde{O}(\sqrt{d^3H^3 T} + d^2H^2 E[\tau])$ worst-case regret in the presence of unknown stochastic delays. Here $E[\tau]$ is the expected delay. To further improve its computational efficiency and to expand its applicability in high-dimensional RL problems, we incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI, which maintains the same order-optimal regret guarantee with $\widetilde{O}(dHK)$ computational cost. Empirical evaluations are performed to demonstrate the statistical and computational efficacy of our algorithms.
    Moment Matching Denoising Gibbs Sampling. (arXiv:2305.11650v2 [stat.ML] UPDATED)
    Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a `noisy' model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets.
    Differentiable Cutting-plane Layers for Mixed-integer Linear Optimization. (arXiv:2311.03350v1 [math.OC])
    We consider the problem of solving a family of parametric mixed-integer linear optimization problems where some entries in the input data change. We introduce the concept of $cutting-plane$ $layer$ (CPL), $i.e.$, a differentiable cutting-plane generator mapping the problem data and previous iterates to cutting planes. We propose a CPL implementation to generate split cuts, and by combining several CPLs, we devise a differentiable cutting-plane algorithm that exploits the repeated nature of parametric instances. In an offline phase, we train our algorithm by updating the parameters controlling the CPLs, thus altering cut generation. Once trained, our algorithm computes, with predictable execution times and a fixed number of cuts, solutions with low integrality gaps. Preliminary computational tests show that our algorithm generalizes on unseen instances and captures underlying parametric structures.
    ProtoryNet - Interpretable Text Classification Via Prototype Trajectories. (arXiv:2007.01777v5 [cs.LG] UPDATED)
    We propose a novel interpretable deep neural network for text classification, called ProtoryNet, based on a new concept of prototype trajectories. Motivated by the prototype theory in modern linguistics, ProtoryNet makes a prediction by finding the most similar prototype for each sentence in a text sequence and feeding an RNN backbone with the proximity of each sentence to the corresponding active prototype. The RNN backbone then captures the temporal pattern of the prototypes, which we refer to as prototype trajectories. Prototype trajectories enable intuitive and fine-grained interpretation of the reasoning process of the RNN model, in resemblance to how humans analyze texts. We also design a prototype pruning procedure to reduce the total number of prototypes used by the model for better interpretability. Experiments on multiple public data sets show that ProtoryNet is more accurate than the baseline prototype-based deep neural net and reduces the performance gap compared to state-of-the-art black-box models. In addition, after prototype pruning, the resulting ProtoryNet models only need less than or around 20 prototypes for all datasets, which significantly benefits interpretability. Furthermore, we report a survey result indicating that human users find ProtoryNet more intuitive and easier to understand than other prototype-based methods.
    Is RLHF More Difficult than Standard RL?. (arXiv:2306.14111v2 [cs.LG] UPDATED)
    Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.
    Architecture Matters: Uncovering Implicit Mechanisms in Graph Contrastive Learning. (arXiv:2311.02687v1 [cs.LG])
    With the prosperity of contrastive learning for visual representation learning (VCL), it is also adapted to the graph domain and yields promising performance. However, through a systematic study of various graph contrastive learning (GCL) methods, we observe that some common phenomena among existing GCL methods that are quite different from the original VCL methods, including 1) positive samples are not a must for GCL; 2) negative samples are not necessary for graph classification, neither for node classification when adopting specific normalization modules; 3) data augmentations have much less influence on GCL, as simple domain-agnostic augmentations (e.g., Gaussian noise) can also attain fairly good performance. By uncovering how the implicit inductive bias of GNNs works in contrastive learning, we theoretically provide insights into the above intriguing properties of GCL. Rather than directly porting existing VCL methods to GCL, we advocate for more attention toward the unique architecture of graph learning and consider its implicit influence when designing GCL methods. Code is available at https: //github.com/PKU-ML/ArchitectureMattersGCL.
    Parameter-Agnostic Optimization under Relaxed Smoothness. (arXiv:2311.03252v1 [math.OC])
    Tuning hyperparameters, such as the stepsize, presents a major challenge of training machine learning models. To address this challenge, numerous adaptive optimization algorithms have been developed that achieve near-optimal complexities, even when stepsizes are independent of problem-specific parameters, provided that the loss function is $L$-smooth. However, as the assumption is relaxed to the more realistic $(L_0, L_1)$-smoothness, all existing convergence results still necessitate tuning of the stepsize. In this study, we demonstrate that Normalized Stochastic Gradient Descent with Momentum (NSGD-M) can achieve a (nearly) rate-optimal complexity without prior knowledge of any problem parameter, though this comes at the cost of introducing an exponential term dependent on $L_1$ in the complexity. We further establish that this exponential term is inevitable to such schemes by introducing a theoretical framework of lower bounds tailored explicitly for parameter-agnostic algorithms. Interestingly, in deterministic settings, the exponential factor can be neutralized by employing Gradient Descent with a Backtracking Line Search. To the best of our knowledge, these findings represent the first parameter-agnostic convergence results under the generalized smoothness condition. Our empirical experiments further confirm our theoretical insights.
    Sampling via Gradient Flows in the Space of Probability Measures. (arXiv:2310.03597v2 [stat.ML] UPDATED)
    Sampling a target probability distribution with an unknown normalization constant is a fundamental challenge in computational science and engineering. Recent work shows that algorithms derived by considering gradient flows in the space of probability measures open up new avenues for algorithm development. This paper makes three contributions to this sampling approach by scrutinizing the design components of such gradient flows. Any instantiation of a gradient flow for sampling needs an energy functional and a metric to determine the flow, as well as numerical approximations of the flow to derive algorithms. Our first contribution is to show that the Kullback-Leibler divergence, as an energy functional, has the unique property (among all f-divergences) that gradient flows resulting from it do not depend on the normalization constant of the target distribution. Our second contribution is to study the choice of metric from the perspective of invariance. The Fisher-Rao metric is known as the unique choice (up to scaling) that is diffeomorphism invariant. As a computationally tractable alternative, we introduce a relaxed, affine invariance property for the metrics and gradient flows. In particular, we construct various affine invariant Wasserstein and Stein gradient flows. Affine invariant gradient flows are shown to behave more favorably than their non-affine-invariant counterparts when sampling highly anisotropic distributions, in theory and by using particle methods. Our third contribution is to study, and develop efficient algorithms based on Gaussian approximations of the gradient flows; this leads to an alternative to particle methods. We establish connections between various Gaussian approximate gradient flows, discuss their relation to gradient methods arising from parametric variational inference, and study their convergence properties both theoretically and numerically.
    Uncertainty Quantification via Neural Posterior Principal Components. (arXiv:2309.15533v2 [cs.CV] UPDATED)
    Uncertainty quantification is crucial for the deployment of image restoration models in safety-critical domains, like autonomous driving and biological imaging. To date, methods for uncertainty visualization have mainly focused on per-pixel estimates. Yet, a heatmap of per-pixel variances is typically of little practical use, as it does not capture the strong correlations between pixels. A more natural measure of uncertainty corresponds to the variances along the principal components (PCs) of the posterior distribution. Theoretically, the PCs can be computed by applying PCA on samples generated from a conditional generative model for the input image. However, this requires generating a very large number of samples at test time, which is painfully slow with the current state-of-the-art (diffusion) models. In this work, we present a method for predicting the PCs of the posterior distribution for any input image, in a single forward pass of a neural network. Our method can either wrap around a pre-trained model that was trained to minimize the mean square error (MSE), or can be trained from scratch to output both a predicted image and the posterior PCs. We showcase our method on multiple inverse problems in imaging, including denoising, inpainting, super-resolution, and biological image-to-image translation. Our method reliably conveys instance-adaptive uncertainty directions, achieving uncertainty quantification comparable with posterior samplers while being orders of magnitude faster. Code and examples are available at https://eliasnehme.github.io/NPPC/
    Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder. (arXiv:2311.02794v1 [stat.ML])
    Generative models of observations under interventions have been a vibrant topic of interest across machine learning and the sciences in recent years. For example, in drug discovery, there is a need to model the effects of diverse interventions on cells in order to characterize unknown biological mechanisms of action. We propose the Sparse Additive Mechanism Shift Variational Autoencoder, SAMS-VAE, to combine compositionality, disentanglement, and interpretability for perturbation models. SAMS-VAE models the latent state of a perturbed sample as the sum of a local latent variable capturing sample-specific variation and sparse global variables of latent intervention effects. Crucially, SAMS-VAE sparsifies these global latent variables for individual perturbations to identify disentangled, perturbation-specific latent subspaces that are flexibly composable. We evaluate SAMS-VAE both quantitatively and qualitatively on a range of tasks using two popular single cell sequencing datasets. In order to measure perturbation-specific model-properties, we also introduce a framework for evaluation of perturbation models based on average treatment effects with links to posterior predictive checks. SAMS-VAE outperforms comparable models in terms of generalization across in-distribution and out-of-distribution tasks, including a combinatorial reasoning task under resource paucity, and yields interpretable latent structures which correlate strongly to known biological mechanisms. Our results suggest SAMS-VAE is an interesting addition to the modeling toolkit for machine learning-driven scientific discovery.
    Efficient Robust Bayesian Optimization for Arbitrary Uncertain Inputs. (arXiv:2310.20145v2 [cs.LG] UPDATED)
    Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in the final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nystrom approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve state-of-the-art performance.
    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v3 [stat.ML] UPDATED)
    Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.
    Practical Equivariances via Relational Conditional Neural Processes. (arXiv:2306.10915v2 [stat.ML] UPDATED)
    Conditional Neural Processes (CNPs) are a class of metalearning models popular for combining the runtime efficiency of amortized inference with reliable uncertainty quantification. Many relevant machine learning tasks, such as in spatio-temporal modeling, Bayesian Optimization and continuous control, inherently contain equivariances -- for example to translation -- which the model can exploit for maximal performance. However, prior attempts to include equivariances in CNPs do not scale effectively beyond two input dimensions. In this work, we propose Relational Conditional Neural Processes (RCNPs), an effective approach to incorporate equivariances into any neural process model. Our proposed method extends the applicability and impact of equivariant neural processes to higher dimensions. We empirically demonstrate the competitive performance of RCNPs on a large array of tasks naturally containing equivariances.
    Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Models. (arXiv:2305.17076v2 [cs.LG] UPDATED)
    Wasserstein distributionally robust estimators have emerged as powerful models for prediction and decision-making under uncertainty. These estimators provide attractive generalization guarantees: the robust objective obtained from the training distribution is an exact upper bound on the true risk with high probability. However, existing guarantees either suffer from the curse of dimensionality, are restricted to specific settings, or lead to spurious error terms. In this paper, we show that these generalization guarantees actually hold on general classes of models, do not suffer from the curse of dimensionality, and can even cover distribution shifts at testing. We also prove that these results carry over to the newly-introduced regularized versions of Wasserstein distributionally robust problems.
    PolyLUT: Learning Piecewise Polynomials for Ultra-Low Latency FPGA LUT-based Inference. (arXiv:2309.02334v2 [cs.LG] UPDATED)
    Field-programmable gate arrays (FPGAs) are widely used to implement deep learning inference. Standard deep neural network inference involves the computation of interleaved linear maps and nonlinear activation functions. Prior work for ultra-low latency implementations has hardcoded the combination of linear maps and nonlinear activations inside FPGA lookup tables (LUTs). Our work is motivated by the idea that the LUTs in an FPGA can be used to implement a much greater variety of functions than this. In this paper, we propose a novel approach to training neural networks for FPGA deployment using multivariate polynomials as the basic building block. Our method takes advantage of the flexibility offered by the soft logic, hiding the polynomial evaluation inside the LUTs with minimal overhead. We show that by using polynomial building blocks, we can achieve the same accuracy using considerably fewer layers of soft logic than by using linear functions, leading to significant latency and area improvements. We demonstrate the effectiveness of this approach in three tasks: network intrusion detection, jet identification at the CERN Large Hadron Collider, and handwritten digit recognition using the MNIST dataset.
    Differentiable Clustering with Perturbed Spanning Forests. (arXiv:2305.16358v3 [cs.LG] UPDATED)
    We introduce a differentiable clustering method based on stochastic perturbations of minimum-weight spanning forests. This allows us to include clustering in end-to-end trainable pipelines, with efficient gradients. We show that our method performs well even in difficult settings, such as data sets with high noise and challenging geometries. We also formulate an ad hoc loss to efficiently learn from partial clustering data using this operation. We demonstrate its performance on several data sets for supervised and semi-supervised tasks.
    Fine-Tune Language Models as Differential Equation Solvers. (arXiv:2308.05061v2 [cs.LG] UPDATED)
    In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in learning operators and solving differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data, may inadvertently overlook the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly improves in-context operator learning, but also creates a new path for the application of language models.
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v3 [stat.ML] UPDATED)
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
    Finding Counterfactually Optimal Action Sequences in Continuous State Spaces. (arXiv:2306.03929v2 [cs.LG] UPDATED)
    Whenever a clinician reflects on the efficacy of a sequence of treatment decisions for a patient, they may try to identify critical time steps where, had they made different decisions, the patient's health would have improved. While recent methods at the intersection of causal inference and reinforcement learning promise to aid human experts, as the clinician above, to retrospectively analyze sequential decision making processes, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.
    Bridging RL Theory and Practice with the Effective Horizon. (arXiv:2304.09853v2 [cs.LG] UPDATED)
    Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new dataset, BRIDGE. It consists of 155 deterministic MDPs from common deep RL benchmarks, along with their corresponding tabular representations, which enables us to exactly compute instance-dependent bounds. We choose to focus on deterministic environments because they share many interesting properties of stochastic environments, but are easier to analyze. Using BRIDGE, we find that prior bounds do not correlate well with when deep RL succeeds vs. fails, but discover a surprising property that does. When actions with the highest Q-values under the random policy also have the highest Q-values under the optimal policy (i.e. when it is optimal to be greedy on the random policy's Q function), deep RL tends to succeed; when they don't, deep RL tends to fail. We generalize this property into a new complexity measure of an MDP that we call the effective horizon, which roughly corresponds to how many steps of lookahead search would be needed in that MDP in order to identify the next optimal action, when leaf nodes are evaluated with random rollouts. Using BRIDGE, we show that the effective horizon-based bounds are more closely reflective of the empirical performance of PPO and DQN than prior sample complexity bounds across four metrics. We also find that, unlike existing bounds, the effective horizon can predict the effects of using reward shaping or a pre-trained exploration policy. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon
    On existence, uniqueness and scalability of adversarial robustness measures for AI classifiers. (arXiv:2310.14421v2 [stat.ML] UPDATED)
    Simply-verifiable mathematical conditions for existence, uniqueness and explicit analytical computation of minimal adversarial paths (MAP) and minimal adversarial distances (MAD) for (locally) uniquely-invertible classifiers, for generalized linear models (GLM), and for entropic AI (EAI) are formulated and proven. Practical computation of MAP and MAD, their comparison and interpretations for various classes of AI tools (for neuronal networks, boosted random forests, GLM and EAI) are demonstrated on the common synthetic benchmarks: on a double Swiss roll spiral and its extensions, as well as on the two biomedical data problems (for the health insurance claim predictions, and for the heart attack lethality classification). On biomedical applications it is demonstrated how MAP provides unique minimal patient-specific risk-mitigating interventions in the predefined subsets of accessible control variables.
    Almost Equivariance via Lie Algebra Convolutions. (arXiv:2310.13164v1 [cs.LG] CROSS LISTED)
    Recently, the equivariance of models with respect to a group action has become an important topic of research in machine learning. However, imbuing an architecture with a specific group equivariance imposes a strong prior on the types of data transformations that the model expects to see. While strictly-equivariant models enforce symmetries, real-world data does not always conform to such strict equivariances, be it due to noise in the data or underlying physical laws that encode only approximate or partial symmetries. In such cases, the prior of strict equivariance can actually prove too strong and cause models to underperform on real-world data. Therefore, in this work we study a closely related topic, that of almost equivariance. We provide a definition of almost equivariance that differs from those extant in the current literature and give a practical method for encoding almost equivariance in models by appealing to the Lie algebra of a Lie group. Specifically, we define Lie algebra convolutions and demonstrate that they offer several benefits over Lie group convolutions, including being well-defined for non-compact groups. From there, we pivot to the realm of theory and demonstrate connections between the notions of equivariance and isometry and those of almost equivariance and almost isometry, respectively. We prove two existence theorems, one showing the existence of almost isometries within bounded distance of isometries of a general manifold, and another showing the converse for Hilbert spaces. We then extend these theorems to prove the existence of almost equivariant manifold embeddings within bounded distance of fully equivariant embedding functions, subject to certain constraints on the group action and the function class. Finally, we demonstrate the validity of our approach by benchmarking against datasets in fully equivariant and almost equivariant settings.
    Estimation and inference for transfer learning with high-dimensional quantile regression. (arXiv:2211.14578v3 [stat.ML] UPDATED)
    Transfer learning has become an essential technique to exploit information from the source domain to boost performance of the target task. Despite the prevalence in high-dimensional data, heterogeneity and heavy tails are insufficiently accounted for by current transfer learning approaches and thus may undermine the resulting performance. We propose a transfer learning procedure in the framework of high-dimensional quantile regression models to accommodate heterogeneity and heavy tails in the source and target domains. We establish error bounds of transfer learning estimator based on delicately selected transferable source domains, showing that lower error bounds can be achieved for critical selection criterion and larger sample size of source tasks. We further propose valid confidence interval and hypothesis test procedures for individual component of high-dimensional quantile regression coefficients by advocating a double transfer learning estimator, which is one-step debiased estimator for the transfer learning estimator wherein the technique of transfer learning is designed again. By adopting data-splitting technique, we advocate a transferability detection approach that guarantees to circumvent negative transfer and identify transferable sources with high probability. Simulation results demonstrate that the proposed method exhibits some favorable and compelling performances and the practical utility is further illustrated by analyzing a real example.
    Detecting hidden confounding in observational data using multiple environments. (arXiv:2205.13935v4 [stat.ME] UPDATED)
    A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify this assumption from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large.
    For SALE: State-Action Representation Learning for Deep Reinforcement Learning. (arXiv:2306.02451v2 [cs.LG] UPDATED)
    In the field of reinforcement learning (RL), representation learning is a proven tool for complex image-based tasks, but is often overlooked for environments with low-level states, such as physical control problems. This paper introduces SALE, a novel approach for learning embeddings that model the nuanced interaction between state and action, enabling effective representation learning from low-level states. We extensively study the design space of these embeddings and highlight important design considerations. We integrate SALE and an adaptation of checkpoints for RL into TD3 to form the TD7 algorithm, which significantly outperforms existing continuous control algorithms. On OpenAI gym benchmark tasks, TD7 has an average performance gain of 276.7% and 50.7% over TD3 at 300k and 5M time steps, respectively, and works in both the online and offline settings.
    Learning Hard-Constrained Models with One Sample. (arXiv:2311.03332v1 [cs.LG])
    We consider the problem of estimating the parameters of a Markov Random Field with hard-constraints using a single sample. As our main running examples, we use the $k$-SAT and the proper coloring models, as well as general $H$-coloring models; for all of these we obtain both positive and negative results. In contrast to the soft-constrained case, we show in particular that single-sample estimation is not always possible, and that the existence of an estimator is related to the existence of non-satisfiable instances. Our algorithms are based on the pseudo-likelihood estimator. We show variance bounds for this estimator using coupling techniques inspired, in the case of $k$-SAT, by Moitra's sampling algorithm (JACM, 2019); our positive results for colorings build on this new coupling approach. For $q$-colorings on graphs with maximum degree $d$, we give a linear-time estimator when $q>d+1$, whereas the problem is non-identifiable when $q\leq d+1$. For general $H$-colorings, we show that standard conditions that guarantee sampling, such as Dobrushin's condition, are insufficient for one-sample learning; on the positive side, we provide a general condition that is sufficient to guarantee linear-time learning and obtain applications for proper colorings and permissive models. For the $k$-SAT model on formulas with maximum degree $d$, we provide a linear-time estimator when $k\gtrsim 6.45\log d$, whereas the problem becomes non-identifiable when $k\lesssim \log d$.
    Are you using test log-likelihood correctly?. (arXiv:2212.00219v3 [stat.ML] UPDATED)
    Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.
    Benign Overfitting for Two-layer ReLU Convolutional Neural Networks. (arXiv:2303.04145v2 [cs.LG] UPDATED)
    Modern deep learning models with great expressive power can be trained to overfit the training data but still generalize well. This phenomenon is referred to as \textit{benign overfitting}. Recently, a few studies have attempted to theoretically understand benign overfitting in neural networks. However, these works are either limited to neural networks with smooth activation functions or to the neural tangent kernel regime. How and when benign overfitting can occur in ReLU neural networks remains an open problem. In this work, we seek to answer this question by establishing algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise. We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk. Our result also reveals a sharp transition between benign and harmful overfitting under different conditions on data distribution in terms of test risk. Experiments on synthetic data back up our theory.
    Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes. (arXiv:2212.06132v3 [cs.LG] UPDATED)
    We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition probability can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the optimal value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
    Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches. (arXiv:2206.03827v7 [stat.ML] UPDATED)
    Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over SOTA sketching methods.
    A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence. (arXiv:2301.13139v3 [stat.ML] UPDATED)
    Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
    Transfer-Learning Across Datasets with Different Input Dimensions: An Algorithm and Analysis for the Linear Regression Case. (arXiv:2202.05069v4 [stat.ML] UPDATED)
    With the development of new sensors and monitoring devices, more sources of data become available to be used as inputs for machine learning models. These can on the one hand help to improve the accuracy of a model. On the other hand, combining these new inputs with historical data remains a challenge that has not yet been studied in enough detail. In this work, we propose a transfer learning algorithm that combines new and historical data with different input dimensions. This approach is easy to implement, efficient, with computational complexity equivalent to the ordinary least-squares method, and requires no hyperparameter tuning, making it straightforward to apply when the new data is limited. Different from other approaches, we provide a rigorous theoretical study of its robustness, showing that it cannot be outperformed by a baseline that utilizes only the new data. Our approach achieves state-of-the-art performance on 9 real-life datasets, outperforming the linear DSFT, another linear transfer learning algorithm, and performing comparably to non-linear DSFT.
    DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model. (arXiv:2306.01001v2 [cs.LG] UPDATED)
    Electrical load forecasting plays a crucial role in decision-making for power systems, including unit commitment and economic dispatch. The integration of renewable energy sources and the occurrence of external events, such as the COVID-19 pandemic, have rapidly increased uncertainties in load forecasting. The uncertainties in load forecasting can be divided into two types: epistemic uncertainty and aleatoric uncertainty. Separating these types of uncertainties can help decision-makers better understand where and to what extent the uncertainty is, thereby enhancing their confidence in the following decision-making. This paper proposes a diffusion-based Seq2Seq structure to estimate epistemic uncertainty and employs the robust additive Cauchy distribution to estimate aleatoric uncertainty. Our method not only ensures the accuracy of load forecasting but also demonstrates the ability to separate the two types of uncertainties and be applicable to different levels of loads. The relevant code can be found at \url{https://anonymous.4open.science/r/DiffLoad-4714/}.
    Flooding with Absorption: An Efficient Protocol for Heterogeneous Bandits over Complex Networks. (arXiv:2303.05445v3 [cs.LG] UPDATED)
    Multi-armed bandits are extensively used to model sequential decision-making, making them ubiquitous in many real-life applications such as online recommender systems and wireless networking. We consider a multi-agent setting where each agent solves their own bandit instance endowed with a different set of arms. Their goal is to minimize their group regret while collaborating via some communication protocol over a given network. Previous literature on this problem only considered arm heterogeneity and networked agents separately. In this work, we introduce a setting that encompasses both features. For this novel setting, we first provide a rigorous regret analysis for a standard flooding protocol combined with the classic UCB policy. Then, to mitigate the issue of high communication costs incurred by flooding in complex networks, we propose a new protocol called Flooding with Absorption (FwA). We provide a theoretical analysis of the resulting regret bound and discuss the advantages of using FwA over flooding. Lastly, we experimentally verify on various scenarios, including dynamic networks, that FwA leads to significantly lower communication costs despite minimal regret performance loss compared to other network protocols.
    Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions. (arXiv:2311.02695v1 [stat.ML])
    The task of inferring high-level causal variables from low-level observations, commonly referred to as causal representation learning, is fundamentally underconstrained. As such, recent works to address this problem focus on various assumptions that lead to identifiability of the underlying latent causal variables. A large corpus of these preceding approaches consider multi-environment data collected under different interventions on the causal model. What is common to virtually all of these works is the restrictive assumption that in each environment, only a single variable is intervened on. In this work, we relax this assumption and provide the first identifiability result for causal representation learning that allows for multiple variables to be targeted by an intervention within one environment. Our approach hinges on a general assumption on the coverage and diversity of interventions across environments, which also includes the shared assumption of single-node interventions of previous works. The main idea behind our approach is to exploit the trace that interventions leave on the variance of the ground truth causal variables and regularizing for a specific notion of sparsity with respect to this trace. In addition to and inspired by our theoretical contributions, we present a practical algorithm to learn causal representations from multi-node interventional data and provide empirical evidence that validates our identifiability results.
    Deep Learning with Kernels through RKHM and the Perron-Frobenius Operator. (arXiv:2305.13588v2 [stat.ML] UPDATED)
    Reproducing kernel Hilbert $C^*$-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of $C^*$-algebra, and the Perron-Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron-Frobenius operators. By virtue of $C^*$-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that $C^*$-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.
    Using multimodal learning and deep generative models for corporate bankruptcy prediction. (arXiv:2211.08405v4 [q-fin.RM] UPDATED)
    Textual data from financial filings, e.g., the Management's Discussion \& Analysis (MDA) section in Form 10-K, has been used to improve the prediction accuracy of bankruptcy models. In practice, however, we cannot obtain the MDA section for all public companies. The two main reasons for the lack of MDA are: (i) not all companies are obliged to submit the MDA and (ii) technical problems arise when crawling and scrapping the MDA section. This research introduces for the first time, to the best of our knowledge, the concept of multimodal learning in bankruptcy prediction models to solve the problem that for some companies we are unable to obtain the MDA text. We use the Conditional Multimodal Discriminative (CMMD) model to learn multimodal representations that embed information from accounting, market, and textual modalities. The CMMD model needs a sample with all data modalities for model training. At test time, the CMMD model only needs access to accounting and market modalities to generate multimodal representations, which are further used to make bankruptcy predictions. This fact makes the use of bankruptcy prediction models using textual data realistic and possible, since accounting and market data are available for all companies unlike textual data. The empirical results in this research show that the classification performance of our proposed methodology is superior compared to that of a large number of traditional classifier models. We also show that our proposed methodology solves the limitation of previous bankruptcy models using textual data, as they can only make predictions for a small proportion of companies.
    Robust Meta-Representation Learning via Global Label Inference and Classification. (arXiv:2212.11702v2 [cs.LG] UPDATED)
    Few-shot learning (FSL) is a central problem in meta-learning, where learners must efficiently learn from few labeled examples. Within FSL, feature pre-training has recently become an increasingly popular strategy to significantly improve generalization performance. However, the contribution of pre-training is often overlooked and understudied, with limited theoretical understanding of its impact on meta-learning performance. Further, pre-training requires a consistent set of global labels shared across training tasks, which may be unavailable in practice. In this work, we address the above issues by first showing the connection between pre-training and meta-learning. We discuss why pre-training yields more robust meta-representation and connect the theoretical analysis to existing works and empirical results. Secondly, we introduce Meta Label Learning (MeLa), a novel meta-learning algorithm that learns task relations by inferring global labels across tasks. This allows us to exploit pre-training for FSL even when global labels are unavailable or ill-defined. Lastly, we introduce an augmented pre-training procedure that further improves the learned meta-representation. Empirically, MeLa outperforms existing methods across a diverse range of benchmarks, in particular under a more challenging setting where the number of training tasks is limited and labels are task-specific. We also provide extensive ablation study to highlight its key properties.
    Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v3 [cs.LG] UPDATED)
    The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.
    A Contrastive Approach to Online Change Point Detection. (arXiv:2206.10143v3 [stat.ML] UPDATED)
    We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.
    A New Bandit Setting Balancing Information from State Evolution and Corrupted Context. (arXiv:2011.07989v4 [cs.LG] UPDATED)
    We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its arm selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.
    Independent finite approximations for Bayesian nonparametric inference. (arXiv:2009.10780v4 [stat.ME] UPDATED)
    Completely random measures (CRMs) and their normalizations (NCRMs) offer flexible models in Bayesian nonparametrics. But their infinite dimensionality presents challenges for inference. Two popular finite approximations are truncated finite approximations (TFAs) and independent finite approximations (IFAs). While the former have been well-studied, IFAs lack similarly general bounds on approximation error, and there has been no systematic comparison between the two options. In the present work, we propose a general recipe to construct practical finite-dimensional approximations for homogeneous CRMs and NCRMs, in the presence or absence of power laws. We call our construction the automated independent finite approximation (AIFA). Relative to TFAs, we show that AIFAs facilitate more straightforward derivations and use of parallel computing in approximate inference. We upper bound the approximation error of AIFAs for a wide class of common CRMs and NCRMs -- and thereby develop guidelines for choosing the approximation level. Our lower bounds in key cases suggest that our upper bounds are tight. We prove that, for worst-case choices of observation likelihoods, TFAs are more efficient than AIFAs. Conversely, we find that in real-data experiments with standard likelihoods, AIFAs and TFAs perform similarly. Moreover, we demonstrate that AIFAs can be used for hyperparameter estimation even when other potential IFA options struggle or do not apply.
    Forward $\chi^2$ Divergence Based Variational Importance Sampling. (arXiv:2311.02516v1 [cs.LG])
    Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly estimates and maximizes the log-likelihood. VIS leverages the optimal proposal distribution, achieved by minimizing the forward $\chi^2$ divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation.
    Steady-State Analysis of Queues with Hawkes Arrival and Its Application to Online Learning for Hawkes Queues. (arXiv:2311.02577v1 [math.PR])
    We investigate the long-run behavior of single-server queues with Hawkes arrivals and general service distributions and related optimization problems. In detail, utilizing novel coupling techniques, we establish finite moment bounds for the stationary distribution of the workload and busy period processes. In addition, we are able to show that, those queueing processes converge exponentially fast to their stationary distribution. Based on these theoretic results, we develop an efficient numerical algorithm to solve the optimal staffing problem for the Hawkes queues in a data-driven manner. Numerical results indicate a sharp difference in staffing for Hawkes queues, compared to the classic GI/GI/1 model, especially in the heavy-traffic regime.
    Approximating Langevin Monte Carlo with ResNet-like Neural Network architectures. (arXiv:2311.03242v1 [cs.LG])
    We sample from a given target distribution by constructing a neural network which maps samples from a simple reference, e.g. the standard normal distribution, to samples from the target. To that end, we propose using a neural network architecture inspired by the Langevin Monte Carlo (LMC) algorithm. Based on LMC perturbation results, we show approximation rates of the proposed architecture for smooth, log-concave target distributions measured in the Wasserstein-$2$ distance. The analysis heavily relies on the notion of sub-Gaussianity of the intermediate measures of the perturbed LMC process. In particular, we derive bounds on the growth of the intermediate variance proxies under different assumptions on the perturbations. Moreover, we propose an architecture similar to deep residual neural networks and derive expressivity results for approximating the sample to target distribution map.
    Data-Dependent Bounds for Online Portfolio Selection Without Lipschitzness and Smoothness. (arXiv:2305.13946v2 [cs.LG] UPDATED)
    This work introduces the first small-loss and gradual-variation regret bounds for online portfolio selection, marking the first instances of data-dependent bounds for online convex optimization with non-Lipschitz, non-smooth losses. The algorithms we propose exhibit sublinear regret rates in the worst cases and achieve logarithmic regrets when the data is "easy," with per-iteration time almost linear in the number of investment alternatives. The regret bounds are derived using novel smoothness characterizations of the logarithmic loss, a local norm-based analysis of following the regularized leader (FTRL) with self-concordant regularizers, which are not necessarily barriers, and an implicit variant of optimistic FTRL with the log-barrier.
    Barron Space for Graph Convolution Neural Networks. (arXiv:2311.02838v1 [stat.ML])
    Graph convolutional neural network (GCNN) operates on graph domain and it has achieved a superior performance to accomplish a wide range of tasks. In this paper, we introduce a Barron space of functions on a compact domain of graph signals. We prove that the proposed Barron space is a reproducing kernel Banach space, it can be decomposed into the union of a family of reproducing kernel Hilbert spaces with neuron kernels, and it could be dense in the space of continuous functions on the domain. Approximation property is one of the main principles to design neural networks. In this paper, we show that outputs of GCNNs are contained in the Barron space and functions in the Barron space can be well approximated by outputs of some GCNNs in the integrated square and uniform measurements. We also estimate the Rademacher complexity of functions with bounded Barron norm and conclude that functions in the Barron space could be learnt from their random samples efficiently.  ( 2 min )
    An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond. (arXiv:2305.16041v2 [stat.ML] UPDATED)
    We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.
    Low Tensor Rank Learning of Neural Dynamics. (arXiv:2308.11567v2 [q-bio.NC] UPDATED)
    Learning relies on coordinated synaptic changes in recurrently connected populations of neurons. Therefore, understanding the collective evolution of synaptic connectivity over learning is a key challenge in neuroscience and machine learning. In particular, recent work has shown that the weight matrices of task-trained RNNs are typically low rank, but how this low rank structure unfolds over learning is unknown. To address this, we investigate the rank of the 3-tensor formed by the weight matrices throughout learning. By fitting RNNs of varying rank to large-scale neural recordings during a motor learning task, we find that the inferred weights are low-tensor-rank and therefore evolve over a fixed low-dimensional subspace throughout the entire course of learning. We next validate the observation of low-tensor-rank learning on an RNN trained to solve the same task. Finally, we present a set of mathematical results bounding the matrix and tensor ranks of gradient descent learning dynamics which show that low-tensor-rank weights emerge naturally in RNNs trained to solve low-dimensional tasks. Taken together, our findings provide insight on the evolution of population connectivity over learning in both biological and artificial neural networks, and enable reverse engineering of learning-induced changes in recurrent dynamics from large-scale neural recordings.
    Neural Structure Learning with Stochastic Differential Equations. (arXiv:2311.03309v1 [cs.LG])
    Discovering the underlying relationships among variables from temporal observations has been a longstanding challenge in numerous scientific disciplines, including biology, finance, and climate science. The dynamics of such systems are often best described using continuous-time stochastic processes. Unfortunately, most existing structure learning approaches assume that the underlying process evolves in discrete-time and/or observations occur at regular time intervals. These mismatched assumptions can often lead to incorrect learned structures and models. In this work, we introduce a novel structure learning method, SCOTCH, which combines neural stochastic differential equations (SDE) with variational inference to infer a posterior distribution over possible structures. This continuous-time approach can naturally handle both learning from and predicting observations at arbitrary time points. Theoretically, we establish sufficient conditions for an SDE and SCOTCH to be structurally identifiable, and prove its consistency under infinite data limits. Empirically, we demonstrate that our approach leads to improved structure learning performance on both synthetic and real-world datasets compared to relevant baselines under regular and irregular sampling intervals.  ( 2 min )
    Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective. (arXiv:2305.15408v4 [cs.LG] UPDATED)
    Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format. Moreover, we show LLMs with CoT can handle a general class of decision-making problems known as Dynamic Programming, thus justifying its power in tackling complex real-world tasks. Finally, an extensive set of experiments show that, while Transformers always fail to directly predict the answers, they can consistently learn to generate correct solutions step-by-step given sufficient CoT demonstrations.
    New Insights into Graph Convolutional Networks using Neural Tangent Kernels. (arXiv:2110.04060v2 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCNs) have emerged as powerful tools for learning on network structured data. Although empirically successful, GCNs exhibit certain behaviour that has no rigorous explanation -- for instance, the performance of GCNs significantly degrades with increasing network depth, whereas it improves marginally with depth using skip connections. This paper focuses on semi-supervised learning on graphs, and explains the above observations through the lens of Neural Tangent Kernels (NTKs). We derive NTKs corresponding to infinitely wide GCNs (with and without skip connections). Subsequently, we use the derived NTKs to identify that, with suitable normalisation, network depth does not always drastically reduce the performance of GCNs -- a fact that we also validate through extensive simulation. Furthermore, we propose NTK as an efficient `surrogate model' for GCNs that does not suffer from performance fluctuations due to hyper-parameter tuning since it is a hyper-parameter free deterministic kernel. The efficacy of this idea is demonstrated through a comparison of different skip connections for GCNs using the surrogate NTKs.
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v4 [stat.ML] UPDATED)
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. Our main result is a PAC-Bayes theorem which holds for a wide class of discrete stochastic processes. We show how this result implies time-uniform versions of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds. Our framework also enables us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
    Strong statistical parity through fair synthetic data. (arXiv:2311.03000v1 [cs.LG])
    AI-generated synthetic data, in addition to protecting the privacy of original data sets, allows users and data consumers to tailor data to their needs. This paper explores the creation of synthetic data that embodies Fairness by Design, focusing on the statistical parity fairness definition. By equalizing the learned target probability distributions of the synthetic data generator across sensitive attributes, a downstream model trained on such synthetic data provides fair predictions across all thresholds, that is, strong fair predictions even when inferring from biased, original data. This fairness adjustment can be either directly integrated into the sampling process of a synthetic generator or added as a post-processing step. The flexibility allows data consumers to create fair synthetic data and fine-tune the trade-off between accuracy and fairness without any previous assumptions on the data or re-training the synthetic data generator.
    A Theory for Emergence of Complex Skills in Language Models. (arXiv:2307.15936v2 [cs.LG] UPDATED)
    A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
    Adaptive Linear Estimating Equations. (arXiv:2307.07320v2 [math.ST] UPDATED)
    Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least square estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.
    One-Shot Strategic Classification Under Unknown Costs. (arXiv:2311.02761v1 [cs.LG])
    A primary goal in strategic classification is to learn decision rules which are robust to strategic input manipulation. Earlier works assume that strategic responses are known; while some recent works address the important challenge of unknown responses, they exclusively study sequential settings which allow multiple model deployments over time. But there are many domains$\unicode{x2014}$particularly in public policy, a common motivating use-case$\unicode{x2014}$where multiple deployments are unrealistic, or where even a single bad round is undesirable. To address this gap, we initiate the study of strategic classification under unknown responses in the one-shot setting, which requires committing to a single classifier once. Focusing on the users' cost function as the source of uncertainty, we begin by proving that for a broad class of costs, even a small mis-estimation of the true cost can entail arbitrarily low accuracy in the worst case. In light of this, we frame the one-shot task as a minimax problem, with the goal of identifying the classifier with the smallest worst-case risk over an uncertainty set of possible costs. Our main contribution is efficient algorithms for both the full-batch and stochastic settings, which we prove converge (offline) to the minimax optimal solution at the dimension-independent rate of $\tilde{\mathcal{O}}(T^{-\frac{1}{2}})$. Our analysis reveals important structure stemming from the strategic nature of user responses, particularly the importance of dual norm regularization with respect to the cost function.  ( 2 min )
    Heteroskedastic Tensor Clustering. (arXiv:2311.02306v1 [math.ST])
    Tensor clustering, which seeks to extract underlying cluster structures from noisy tensor observations, has gained increasing attention. One extensively studied model for tensor clustering is the tensor block model, which postulates the existence of clustering structures along each mode and has found broad applications in areas like multi-tissue gene expression analysis and multilayer network analysis. However, currently available computationally feasible methods for tensor clustering either are limited to handling i.i.d. sub-Gaussian noise or suffer from suboptimal statistical performance, which restrains their utility in applications that have to deal with heteroskedastic data and/or low signal-to-noise-ratio (SNR). To overcome these challenges, we propose a two-stage method, named $\mathsf{High\text{-}order~HeteroClustering}$ ($\mathsf{HHC}$), which starts by performing tensor subspace estimation via a novel spectral algorithm called $\mathsf{Thresholded~Deflated\text{-}HeteroPCA}$, followed by approximate $k$-means to obtain cluster nodes. Encouragingly, our algorithm provably achieves exact clustering as long as the SNR exceeds the computational limit (ignoring logarithmic factors); here, the SNR refers to the ratio of the pairwise disparity between nodes to the noise level, and the computational limit indicates the lowest SNR that enables exact clustering with polynomial runtime. Comprehensive simulation and real-data experiments suggest that our algorithm outperforms existing algorithms across various settings, delivering more reliable clustering performance.  ( 2 min )
    Validity problems in clinical machine learning by indirect data labeling using consensus definitions. (arXiv:2311.03037v1 [cs.LG])
    We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.  ( 2 min )
    Weight-Sharing Regularization. (arXiv:2311.03096v1 [cs.LG])
    Weight-sharing is ubiquitous in deep learning. Motivated by this, we introduce ''weight-sharing regularization'' for neural networks, defined as $R(w) = \frac{1}{d - 1}\sum_{i > j}^d |w_i - w_j|$. We study the proximal mapping of $R$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. Using this interpretation, we design a novel parallel algorithm for $\operatorname{prox}_R$ which provides an exponential speedup over previous algorithms, with a depth of $O(\log^3 d)$. Our algorithm makes it feasible to train weight-sharing regularized deep neural networks with proximal gradient descent. Experiments reveal that weight-sharing regularization enables fully-connected networks to learn convolution-like filters.  ( 2 min )
    Practical considerations for variable screening in the Super Learner. (arXiv:2311.03313v1 [stat.ML])
    Estimating a prediction function is a fundamental component of many data analyses. The Super Learner ensemble, a particular implementation of stacking, has desirable theoretical properties and has been used successfully in many applications. Dimension reduction can be accomplished by using variable screening algorithms, including the lasso, within the ensemble prior to fitting other prediction algorithms. However, the performance of a Super Learner using the lasso for dimension reduction has not been fully explored in cases where the lasso is known to perform poorly. We provide empirical results that suggest that a diverse set of candidate screening algorithms should be used to protect against poor performance of any one screen, similar to the guidance for choosing a library of prediction algorithms for the Super Learner.  ( 2 min )
    Variational Weighting for Kernel Density Ratios. (arXiv:2311.03001v1 [cs.LG])
    Kernel density estimation (KDE) is integral to a range of generative and discriminative tasks in machine learning. Drawing upon tools from the multidimensional calculus of variations, we derive an optimal weight function that reduces bias in standard kernel density estimates for density ratios, leading to improved estimates of prediction posteriors and information-theoretic measures. In the process, we shed light on some fundamental aspects of density estimation, particularly from the perspective of algorithms that employ KDEs as their main building blocks.  ( 2 min )
    ELEGANT: Certified Defense on the Fairness of Graph Neural Networks. (arXiv:2311.02757v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a prominent graph learning model in various graph-based tasks over the years. Nevertheless, due to the vulnerabilities of GNNs, it has been empirically proved that malicious attackers could easily corrupt the fairness level of their predictions by adding perturbations to the input graph data. In this paper, we take crucial steps to study a novel problem of certifiable defense on the fairness level of GNNs. Specifically, we propose a principled framework named ELEGANT and present a detailed theoretical certification analysis for the fairness of GNNs. ELEGANT takes any GNNs as its backbone, and the fairness level of such a backbone is theoretically impossible to be corrupted under certain perturbation budgets for attackers. Notably, ELEGANT does not have any assumption over the GNN structure or parameters, and does not require re-training the GNNs to realize certification. Hence it can serve as a plug-and-play framework for any optimized GNNs ready to be deployed. We verify the satisfactory effectiveness of ELEGANT in practice through extensive experiments on real-world datasets across different backbones of GNNs, where ELEGANT is also demonstrated to be beneficial for GNN debiasing. Open-source code can be found at https://github.com/yushundong/ELEGANT.  ( 2 min )
    Regularized Linear Regression for Binary Classification. (arXiv:2311.02270v1 [cs.LG])
    Regularized linear regression is a promising approach for binary classification problems in which the training set has noisy labels since the regularization term can help to avoid interpolating the mislabeled data points. In this paper we provide a systematic study of the effects of the regularization strength on the performance of linear classifiers that are trained to solve binary classification problems by minimizing a regularized least-squares objective. We consider the over-parametrized regime and assume that the classes are generated from a Gaussian Mixture Model (GMM) where a fraction $c<\frac{1}{2}$ of the training data is mislabeled. Under these assumptions, we rigorously analyze the classification errors resulting from the application of ridge, $\ell_1$, and $\ell_\infty$ regression. In particular, we demonstrate that ridge regression invariably improves the classification error. We prove that $\ell_1$ regularization induces sparsity and observe that in many cases one can sparsify the solution by up to two orders of magnitude without any considerable loss of performance, even though the GMM has no underlying sparsity structure. For $\ell_\infty$ regularization we show that, for large enough regularization strength, the optimal weights concentrate around two values of opposite sign. We observe that in many cases the corresponding "compression" of each weight to a single bit leads to very little loss in performance. These latter observations can have significant practical ramifications.  ( 2 min )
    Riemannian Laplace Approximation with the Fisher Metric. (arXiv:2311.02766v1 [cs.LG])
    The Laplace's method approximates a target density with a Gaussian distribution at its mode. It is computationally efficient and asymptotically exact for Bayesian inference due to the Bernstein-von Mises theorem, but for complex targets and finite-data posteriors it is often too crude an approximation. A recent generalization of the Laplace Approximation transforms the Gaussian approximation according to a chosen Riemannian geometry providing a richer approximation family, while still retaining computational efficiency. However, as shown here, its properties heavily depend on the chosen metric, indeed the metric adopted in previous work results in approximations that are overly narrow as well as being biased even at the limit of infinite data. We correct this shortcoming by developing the approximation family further, deriving two alternative variants that are exact at the limit of infinite data, extending the theoretical analysis of the method, and demonstrating practical improvements in a range of experiments.  ( 2 min )
    From Coupled Oscillators to Graph Neural Networks: Reducing Over-smoothing via a Kuramoto Model-based Approach. (arXiv:2311.03260v1 [cs.LG])
    We propose the Kuramoto Graph Neural Network (KuramotoGNN), a novel class of continuous-depth graph neural networks (GNNs) that employs the Kuramoto model to mitigate the over-smoothing phenomenon, in which node features in GNNs become indistinguishable as the number of layers increases. The Kuramoto model captures the synchronization behavior of non-linear coupled oscillators. Under the view of coupled oscillators, we first show the connection between Kuramoto model and basic GNN and then over-smoothing phenomenon in GNNs can be interpreted as phase synchronization in Kuramoto model. The KuramotoGNN replaces this phase synchronization with frequency synchronization to prevent the node features from converging into each other while allowing the system to reach a stable synchronized state. We experimentally verify the advantages of the KuramotoGNN over the baseline GNNs and existing methods in reducing over-smoothing on various graph deep learning benchmark tasks.  ( 2 min )
    Spatial Process Approximations: Assessing Their Necessity. (arXiv:2311.03201v1 [stat.ML])
    In spatial statistics and machine learning, the kernel matrix plays a pivotal role in prediction, classification, and maximum likelihood estimation. A thorough examination reveals that for large sample sizes, the kernel matrix becomes ill-conditioned, provided the sampling locations are fairly evenly distributed. This condition poses significant challenges to numerical algorithms used in prediction and estimation computations and necessitates an approximation to prediction and the Gaussian likelihood. A review of current methodologies for managing large spatial data indicates that some fail to address this ill-conditioning problem. Such ill-conditioning often results in low-rank approximations of the stochastic processes. This paper introduces various optimality criteria and provides solutions for each.  ( 2 min )
    On Subagging Boosted Probit Model Trees. (arXiv:2311.02827v1 [stat.ML])
    With the insight of variance-bias decomposition, we design a new hybrid bagging-boosting algorithm named SBPMT for classification problems. For the boosting part of SBPMT, we propose a new tree model called Probit Model Tree (PMT) as base classifiers in AdaBoost procedure. For the bagging part, instead of subsampling from the dataset at each step of boosting, we perform boosted PMTs on each subagged dataset and combine them into a powerful "committee", which can be viewed an incomplete U-statistic. Our theoretical analysis shows that (1) SBPMT is consistent under certain assumptions, (2) Increase the subagging times can reduce the generalization error of SBPMT to some extent and (3) Large number of ProbitBoost iterations in PMT can benefit the performance of SBPMT with fewer steps in the AdaBoost part. Those three properties are verified by a famous simulation designed by Mease and Wyner (2008). The last two points also provide a useful guidance in model tuning. A comparison of performance with other state-of-the-art classification methods illustrates that the proposed SBPMT algorithm has competitive prediction power in general and performs significantly better in some cases.  ( 2 min )
    Nonparametric modeling of the composite effect of multiple nutrients on blood glucose dynamics. (arXiv:2311.03129v1 [stat.ML])
    In biomedical applications it is often necessary to estimate a physiological response to a treatment consisting of multiple components, and learn the separate effects of the components in addition to the joint effect. Here, we extend existing probabilistic nonparametric approaches to explicitly address this problem. We also develop a new convolution-based model for composite treatment-response curves that is more biologically interpretable. We validate our models by estimating the impact of carbohydrate and fat in meals on blood glucose. By differentiating treatment components, incorporating their dosages, and sharing statistical information across patients via a hierarchical multi-output Gaussian process, our method improves prediction accuracy over existing approaches, and allows us to interpret the different effects of carbohydrates and fat on the overall glucose response.  ( 2 min )
    Bayesian Optimization of Function Networks with Partial Evaluations. (arXiv:2311.02146v1 [stat.ML])
    Bayesian optimization is a framework for optimizing functions that are costly or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is computed via a network of functions, each taking as input the output of previous nodes in the network and additional parameters. Exploiting this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks are required to evaluate the full network at each iteration. However, many real-world applications allow evaluating nodes individually. To take advantage of this opportunity, we propose a novel knowledge gradient acquisition function for BOFN that chooses which node to evaluate as well as the inputs for that node in a cost-aware fashion. This approach can dramatically reduce query costs by allowing the evaluation of part of the network at a lower cost relative to evaluating the entire network. We provide an efficient approach to optimizing our acquisition function and show it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks.  ( 2 min )
    Estimating treatment effects from single-arm trials via latent-variable modeling. (arXiv:2311.03002v1 [cs.LG])
    Randomized controlled trials (RCTs) are the accepted standard for treatment effect estimation but they can be infeasible due to ethical reasons and prohibitive costs. Single-arm trials, where all patients belong to the treatment group, can be a viable alternative but require access to an external control group. We propose an identifiable deep latent-variable model for this scenario that can also account for missing covariate observations by modeling their structured missingness patterns. Our method uses amortized variational inference to learn both group-specific and identifiable shared latent representations, which can subsequently be used for (i) patient matching if treatment outcomes are not available for the treatment group, or for (ii) direct treatment effect estimation assuming outcomes are available for both groups. We evaluate the model on a public benchmark as well as on a data set consisting of a published RCT study and real-world electronic health records. Compared to previous methods, our results show improved performance both for direct treatment effect estimation as well as for effect estimation via patient matching.  ( 2 min )
    Log-Concavity of Multinomial Likelihood Functions Under Interval Censoring Constraints on Frequencies or Their Partial Sums. (arXiv:2311.02763v1 [math.ST])
    We show that the likelihood function for a multinomial vector observed under arbitrary interval censoring constraints on the frequencies or their partial sums is completely log-concave by proving that the constrained sample spaces comprise M-convex subsets of the discrete simplex.  ( 2 min )
    Exploiting Correlated Auxiliary Feedback in Parameterized Bandits. (arXiv:2311.02715v1 [cs.LG])
    We study a novel variant of the parameterized bandits problem in which the learner can observe additional auxiliary feedback that is correlated with the observed reward. The auxiliary feedback is readily available in many real-life applications, e.g., an online platform that wants to recommend the best-rated services to its users can observe the user's rating of service (rewards) and collect additional information like service delivery time (auxiliary feedback). In this paper, we first develop a method that exploits auxiliary feedback to build a reward estimator with tight confidence bounds, leading to a smaller regret. We then characterize the regret reduction in terms of the correlation coefficient between reward and its auxiliary feedback. Experimental results in different settings also verify the performance gain achieved by our proposed method.  ( 2 min )
    p-Laplacian Transformer. (arXiv:2311.03235v1 [cs.LG])
    $p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data. Smaller values of $p$ promote sparsity and interpretability, while larger values encourage smoother solutions. In this paper, we first show that the self-attention mechanism obtains the minimal Laplacian regularization ($p=2$) and encourages the smoothness in the architecture. However, the smoothness is not suitable for the heterophilic structure of self-attention in transformers where attention weights between tokens that are in close proximity and non-close ones are assigned indistinguishably. From that insight, we then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT), which leverages $p$-Laplacian regularization framework to harness the heterophilic features within self-attention layers. In particular, low $p$ values will effectively assign higher attention weights to tokens that are in close proximity to the current token being processed. We empirically demonstrate the advantages of p-LaT over the baseline transformers on a wide range of benchmark datasets.  ( 2 min )
    Structured Neural Networks for Density Estimation and Causal Inference. (arXiv:2311.02221v1 [cs.LG])
    Injecting structure into neural networks enables learning functions that satisfy invariances with respect to subsets of inputs. For instance, when learning generative models using neural networks, it is advantageous to encode the conditional independence structure of observed variables, often in the form of Bayesian networks. We propose the Structured Neural Network (StrNN), which injects structure through masking pathways in a neural network. The masks are designed via a novel relationship we explore between neural network architectures and binary matrix factorization, to ensure that the desired independencies are respected. We devise and study practical algorithms for this otherwise NP-hard design problem based on novel objectives that control the model architecture. We demonstrate the utility of StrNN in three applications: (1) binary and Gaussian density estimation with StrNN, (2) real-valued density estimation with Structured Autoregressive Flows (StrAFs) and Structured Continuous Normalizing Flows (StrCNF), and (3) interventional and counterfactual analysis with StrAFs for causal inference. Our work opens up new avenues for learning neural networks that enable data-efficient generative modeling and the use of normalizing flows for causal effect estimation.  ( 2 min )
    Improved Convergence Rates of Anderson Acceleration for a Large Class of Fixed-Point Iterations. (arXiv:2311.02490v1 [math.NA])
    This paper studies Anderson acceleration (AA) for fixed-point methods ${x}^{(k+1)}=q({x}^{(k)})$. It provides the first proof that when the operator $q$ is linear and symmetric, AA improves the root-linear convergence factor over the fixed-point iterations. When $q$ is nonlinear, yet has a symmetric Jacobian at the solution, a slightly modified AA algorithm is proved to have an analogous root-linear convergence factor improvement over fixed-point iterations. Simulations verify our observations. Furthermore, experiments with different data models demonstrate AA is significantly superior to the standard fixed-point methods for Tyler's M-estimation.  ( 2 min )

  • Open

    [R] Are there any research papers which show why Wasserstein distance is better than Jensen-Shannon/KL_divergence/Bhattacharya distance for specific use cases ?
    I am trying to find reliable research work which show why displacement based metrics such as Wasserstein distance is a better suited metric than Jensen-Shannon distance in specific use cases and for certain set of distributions. Are there any well known works which delve into this ? submitted by /u/V1bicycle [link] [comments]  ( 9 min )
    [R] Animating NeRFs from Texture Space: A Framework for Pose-Dependent Rendering of Human Performances
    The key challenge is that NeRFs typically require multiple view images to reconstruct a scene in 3D, whereas videos provide only a single view over time. But that means we have to capture a lot of data to create a NeRF. What if there was a way to create 3D animated models of humans from monocular video footage using NeRFs? A new paper addresses this with a novel approach. First, they fit a parametric model (SMPL) to align with the subject in each frame of the video. This provides an initial estimate of the 3D shape. Second, they transform the coordinate system of the NeRF based on the surface of the SMPL model. This involves projecting input points onto the model's surface and calculating distances to the surface. Third, they incorporate the SMPL model's joint rotations to animate it in a variety of poses based on the video. This adds important pose-dependent shape cues. Finally, they use a neural network module to further refine the coordinate transform, correcting any inaccuracies in the SMPL fit to ensure spatial alignments are accurate for rendering. In experiments, they demonstrate their method generates high-quality renderings of subjects in novel views and poses not seen in the original video footage. The results capture nuanced clothing and hair deformations in a pose-dependent way. There are some example photos in the article that really show this off. Limitations exist for handling extremely complex motions and generating detailed face/hand geometry from low-resolution videos. But overall, the technique significantly advances the state-of-the-art in reconstructing animatable human models from monocular video. TLDR: They found a new NeRF technique to turn videos into controllable 3D models Full paper summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [P] Want to create an AI to play Super Mario for the DS
    As the title says i want to make an AI that will play New Super Mario Bros. for the DS. I want to be lead in the right direction for any resources that can help me achieve the completion of this project. I know a fair bit about Neural Networks but I am stumped on how to achieve this. I am using python and pytorch to be able to make this happen. Thank you for your responses submitted by /u/antonio_porven [link] [comments]  ( 9 min )
    [P] Model training bottlenecked by CPU.
    Recently, I've been working on some projects for fun, trying out some things I hadn't worked with before, such as profiling. But after profiling my code, I found out that my average GPU activity is around 50%. Apparently, the code frequently hangs for a few hundred milliseconds on the dataloader process. I've tried a few things in the dataloader: increasing/decreasing the number of workers, setting pin-memory to true or false, but neither seems to really matter. I have an NVME drive, so the disk is not the problem either. I've concluded that the bottleneck must be the CPU. Now, I've read that pre-processing the data might help, so that the dataloader doesn't have to decode the images, for example, but I don't really know how to go about this. I have around 2TB of NVME storage, and I've got a couple datasets on the disk (ImageNet and INaturalist are the two biggest ones), so I don't suppose I'll be able to store them on the disk uncompressed. Is there anything I can do to lighten the load on the CPU during training so that I can take advantage of the 50% of the GPU that I'm not using at the moment? submitted by /u/AdSignificant9235 [link] [comments]  ( 9 min )
    [Project] I made a learning environment/game using unity's mlagents and used DeepRL to train an AI agent to autonomously race a car around. (demo available)
    ​ DEMO AVAILABLE submitted by /u/Sookeyy [link] [comments]  ( 9 min )
    Is it possible and feasible to indentify roofing damage, age of roof, and roof type using satellite imagery and a CNN? [R]
    Working on a little side project. How would you do it? How would you start collecting data for this? I’m assuming the hardest part is going to be getting the data. submitted by /u/Longjumping-Name7564 [link] [comments]  ( 9 min )
    Is anyone familiar with methods of evaluating if a model was trained on some data? [R]
    I have a model and some articles. I'm trying to figure out if the model was trained on these articles before. Is anyone familiar with existing methods on how to do this? I know Microsoft had one method here (Appendix B) where they did the following: "for each content datapoint in a dataset, we partition the content in half. We pass in the first half as context and ask the model to generate up to the length of the second half. We then measure how similar the generated content is to the held-out second half using Levenshtein distance ratio, which is defined as one minus the ratio of Levenshtein distance to the maximum possible distance." Wondering if anyone is familiar with other methods submitted by /u/DaBobcat [link] [comments]  ( 9 min )
    [D] A Comprehensive Hand-Curated Resource List for Best OpenAI-GPTs
    Greetings, Excited to share with all those interested in GPTs released by OpenAI at Devday. We are a group of researchers who have carefully curated a comprehensive list on Github of the best GPT models, including descriptions, URLs, and other details. While our initial focus is on GPT models available through OpenAI, we will continuously maintain and update the list as new models are released. Resource list: https://github.com/promptslab/Awesome-Openai-GPTs We hope it will help you to get started & learn more about GPTs. Thank you :) https://preview.redd.it/1pjon2m6azyb1.png?width=1678&format=png&auto=webp&s=9999c13c20f429a8201bf0f1ce51489765c9838d ​ submitted by /u/Alternative-File-146 [link] [comments]  ( 9 min )
    [R] Suggestions for research topics in Neural Network pruning?
    Sorry for the vague question but I’m looking for relevant research topics related to NN pruning. Any suggestions are appreciated, thanks! submitted by /u/Sidekiiick02 [link] [comments]  ( 9 min )
    [discussion] Chatgpt and machine learning studies
    Hello as someone interested by ai and machine learning is it worth it to learn it when we have chatgpt and similar solutions that keeps getting bigger and bigger? Or is it better to go the classic software development route to buile the interfaces and apps for ml solutions? Thnks submitted by /u/Particular_Tea2307 [link] [comments]  ( 9 min )
    [D] Changing model architecture in transformers. Is it me or the library?
    I've got the following issue: Sometimes I want to adjust the architecture of huggingface model, e.g. for implementing the AnyMal Paper on Zephyr. It is not that difficult, just adding another input, concatenating with the text input. While I know how to do it in pure PyTorch, I struggle "hacking" the hugginface models. I end up wrapping them into nn.Module, which creates a mess. Transformers Trainer is not usable in that case, so I end up creating a training loop from scratch (which lacks all the comfort). Alternatively, I mess with the custom dataset and data collator. Is there any simple trick on how to modify the architecture without breaking the transformers models and loosing all the benefits and custom implementation details? submitted by /u/enricopallazo1 [link] [comments]  ( 9 min )
    [D] Master's degree thesis
    Hello everyone, I'm completing my Master's Degree in Artificial Intelligence in Italy and I have to choose the thesis subject. I'd like to deepen my knowledge about LLMs since they are the thing of the moment and we didn't study them during the course. My initial idea was to try to work on hallucinations by integrating external knowledge from Ontologies/Knowledge Bases, probably implementing some form of RAG. However in the last weeks a lot of papers were published on the subject and I don't know if it will still be any relevant when i graduate (in 6 months or so). Also I haven't validated the idea yet and don't know if it makes any sense. Do you have any suggestions on LLMs related topics that I can study for my thesis? submitted by /u/Lonely-Lingonberry-5 [link] [comments]  ( 9 min )
    [R] A Systematic Review of Deep Graph Neural Networks: Challenges, Classification, Architectures, Applications & Potential Utility in Bioinformatics
    Paper: https://arxiv.org/abs/2311.02127 Abstract: In recent years, tasks of machine learning ranging from image processing & audio/video analysis to natural language understanding have been transformed by deep learning. The data content in all these scenarios are expressed via Euclidean space. However, a considerable amount of application data is structured in non-Euclidean space and is expressed as graphs, e.g. dealing with complicated interactions & object interdependencies. Modelling physical systems, learning molecular signatures, identifying protein interactions and predicting diseases involve utilising a model that can adapt from graph data. Graph neural networks (GNNs), specified as artificial-neural models, employ message transmission between graph nodes to represent graph dependencies and are primarily used in the non-Euclidean domain. Variants of GNN like Graph Recurrent Networks (GRN), Graph Auto Encoder (GAE), Graph Convolution Networks (GCN), Graph Adversarial Methods & Graph Reinforcement learning have exhibited breakthrough productivity on a wide range of tasks, especially in the field of bioinformatics, in recent years as a result of the rapid collection of biological network data. Apart from presenting all existing GNN models, mathematical analysis and comparison of the variants of all types of GNN have been highlighted in this survey. Graph neural networks are investigated for their potential real-world applications in various fields, focusing on Bioinformatics. Furthermore, resources for evaluating graph neural network models and accessing open-source code & benchmark data sets are included. Ultimately, we provide some (seven) proposals for future research in this rapidly evolving domain. GNNs have the potential to be an excellent tool for solving a wide range of biological challenges in bioinformatics research, as they are best represented as connected complex graphs. ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [Discussion] Going from PoC to Production-ready AI data infra?
    Many of the best practices from ELT fully apply to the emerging AI tech stack, but there is an extra step missing - publishing transformed data to vector data stores. Airbyte suggests “ELTP”, or Extract, Load, Transform, and Publish, as an extensible architecture for delivering data to vector databases. How are folks syncing data between their sources and vector databases? ​ submitted by /u/Chemical-Treat6596 [link] [comments]  ( 9 min )
    [D] How to learn about distributed training
    Hello, I have been using a single GPU for my school but now there is a need to do distributed training. How should I learn about it? Eventually I want to use deepspeed, huggingface, etc. submitted by /u/rodeowrong [link] [comments]  ( 9 min )
    [P] Launching Arch (Feedback appreciated!): Simplifying Multi-tenant Data Integration for ML/AI applications
    Hello r/MachineLearning and developer community! We’re the team behind Meltano, an open source project for data movement and integration. Today we’re announcing our new company name and product: Arch! We’re building Arch so you can build AI-powered features on your customer’s data without having to worry about building bespoke data infrastructure just to access the data. A clear target for us in building our initial version was to aim to build a tool that could power LLM-based applications, for multi-tenant use cases, using any data, which would be loved by software engineers. What we got so far is: A great demo + two blog posts detailing what we’re in the process of building + a waitlist if you’re interested in catching up. Of course, you can also schedule a demo. Key Highlights of Arch, from what we’ve already built: Fresh Data from Anywhere: Leverage data from any source — APIs, databases, files — seamlessly integrated into Arch. Auto Vector-embeddings: Build AI features fast, with automatic generation and synchronization of vector embeddings for any data. Effortless Multi-Tenancy: Secure and easy management of third-party authentication including OAuth, supporting numerous customers with customizable per-tenant SQL transformations. Flexible Data Access: Access any data table through SQL, GraphQL, or REST directly from your application. Developer-Centric Design: Arch is built with developers in mind, featuring a declarative approach and integrated best practices for version control. If this is of any interest to you, we’d love to chat. Also very happy to answer any questions here. Blog post: https://www.arch.dev/blog/announcing-arch-the-data-backend-for-ai-products/ GitHub: https://github.com/archdotdev submitted by /u/tayloramurphy [link] [comments]
    Panel time series forecasting [p]
    I am stuck in this project for months. I have a panel time series, 2000 time series and each time series only 14 points length. And many covariates ( 20). I need to do prediction for a single target variable while taking into consideration all the covariates. If you have ideas, links, resources that can help me solving this problem, please let me know. Thanks a lot. submitted by /u/Beginner4ever [link] [comments]  ( 9 min )
    [D] Optimization Problem
    There's a trading system that has a few parameters (like TP(take profit), SL(stop loss), Time limit and a couple of other ones). As you might know, there's a certain limit for each of them. Let's say we want the TP to be between 1% and 3% (always). And this happens for all the parameters as well. I specifically want to know how I can find the most suitable set of parameters. For example if tp is 2%, sl is 8%, time limit is 2000 minutes, .... the result would be less riskier(less profit). Since the number of parameters is not low (there're 10 of them) and the range for each of them is not discrete, it's practically impossible for us to run the system for each set of parameters. Therefore we tried using an optimization algorithm to be able to find the desirable result. We ended up using Genetic algorithm. It wasn't bad but turned out was not the best option we could use. Since it takes a lot of time to run (initial population should be 1000 and for each single run it takes 20 minutes, so...) Just consider the fact that the results of close values for parameters are not similar, for example if tp = 1.5%, sl = 8%, time limit = 2000 and I only change tp to 1.8%, the result totally changes! and you can't predict how it will affect the system. So there is a need for an optimization algorithm that can find the numerical values of them. Any ideas are highly appreciated! submitted by /u/Felicity_222 [link] [comments]  ( 9 min )
    [D] How to Build Data Products | Evolve: Part 4/4 Advanced SLOs, Feedback Loops, Optimised Data Product, and more!
    In this article, authors have discussed the evolve stage of 4 part series on how to build data products. Relevant for all the modern age data folks! What awaits you in this succinct piece: >>Introduction: ‘Evolve’ at a Glance, Relevance >>Fundamentals —— Evolutionary Architecture —— Fitness Functions —— Fitness Parameters —— Dynamic Changes —— Feedback Loops >>Evolving Data Products with Self-Serve Data Platforms —— Metrics Monitoring, SLO Optimisation, Catalysts for Higher Adoption —— Progressive Inclusion of Multiple Use Cases —— Resource Optimisation —— Maitenance Automation —— RCA & Log Analysis Enhancements —— Other Capabilities Critical to Evolve Stage >>Fitness in the Context of Data Mesh >>Read the complete article here: https://moderndata101.substack.com/p/evolving-data-products What are your thoughts on the entire approach??? submitted by /u/growth_man [link] [comments]  ( 9 min )
    [D] Is a career in Machine Learning satisfying for Linguists?
    TItle says it all; I have a passion for linguistics and want to apply it outside the academe. As far as I can tell, the majority of machine learning is dealing with databases, but for anyone that's dealt with LLMs, how much of linguistics is actually applied? submitted by /u/Steak-Burrito [link] [comments]  ( 9 min )
    [Research] Looking for an incomplete dataset that should be messy or contain various data quality issues.
    Hello, Reddit community, I'm working on a project that focuses on query-oriented data cleaning with human expert involvement, and I'm in search of a suitable dataset to support this research. The dataset should ideally contain messy or incomplete data. If you know of any relevant datasets or sources where I can find such data, I would greatly appreciate your assistance. Additionally, if you have any suggestions or insights on where to look for datasets with data quality issues, please feel free to share them. Thank you in advance for your help and suggestions! submitted by /u/thelifeofZ080 [link] [comments]
    [P] Residual-free, purely-feedforward network trains to 94% on CIFAR10 in <6.3 seconds on a single A100
    Release link: https://github.com/tysam-code/hlb-CIFAR10/releases/tag/v0.7.0 Hi, I'm sure some of you have seen this project as it's progressed over the past year or so. Today we've hit a pretty big milestone by removing residual layers from the network without reducing the accuracy or convergence time under (what are pretty brutal) optimization conditions. The network trains more quickly as a result of fewer kernel launches, though I am sure the learnable information flow also has some significant impact as well. There is a technical discussion at https://twitter.com/hi_tysam/status/1721764010159477161 that goes over more details. If you'd like a TL;DR -- enforcing information flow in neural networks via external operators (like channel concatenation in DenseNets and channel addition i…  ( 10 min )
    [D] Reverse engineering GPT-vision from pricing
    So I have been looking at GPT4-V pricing trying to determine what kind of pipeline they use, feel free to chime in, dispute, etc. I do not have many conclusions but hoping that the crowd is wiser. ​ https://preview.redd.it/t4udi6lnntyb1.png?width=552&format=png&auto=webp&s=6f02a8c62cb9eb5b104ef36f50d2e8d0ee7a431c Observations: you are billed on the token count which can be calculated from the image resolution. This suggests they do not do append the OCR result to the GPT4 input. There are 85 base tokens irrespective of the image size. Maybe they run the whole image through some vision encoder and somehow get 85 tokens? 85 is a strange number, not close base of 2, no convenient squares, what does it have? why 85? Maybe to mess with us? The image is tiled with 512x512 tiles, each tile converts to 170 tokens. 170 = 13*13+1? Maybe they use some kind of OCR and 170 is the average number of tokens they expect? But that would mean that gpt4 should not be able to differentiate small things in an image (it would just have 85 global tokens). Knowing OpenAI it seems unlikely they would have the 2 stage pipeline. GPT4-V can accurately read text from image. My guess would be a strong vision encoder for the 85 tokens and some light encoder for the 512x512 tiles, where most of the processing happens inside gpt4. Retrofitting GPT4 with vision suggests they have a vision encoder which maps to GPT4 tokens. ​ What do you think? submitted by /u/President_Xi_ [link] [comments]
    [N] GPT-4 Turbo with 128K of context
    https://openai.com/blog/new-models-and-developer-products-announced-at-devday Excited for the RAG implementations this will support. Also, turbo! submitted by /u/gar1t [link] [comments]
    [Project] What process is used in these reinforcement learning models I see on Youtube?
    I have seen videos like this: https://www.youtube.com/watch?v=Dw3BZ6O_8LY and I'm curious as to how someone would do this. Some classmates and I want to recreate a board game and train a model on a digital recreation of it, but we're not sure where to start. Any help is useful! submitted by /u/Aboudi556 [link] [comments]
  • Open

    They found a new NeRF technique to turn videos into controllable 3D models
    The key challenge is that NeRFs typically require multiple view images to reconstruct a scene in 3D, whereas videos provide only a single view over time. But that means we have to capture a lot of data to create a NeRF. What if there was a way to create 3D animated models of humans from monocular video footage using NeRFs? A new paper addresses this with a novel approach. First, they fit a parametric model (SMPL) to align with the subject in each frame of the video. This provides an initial estimate of the 3D shape. Second, they transform the coordinate system of the NeRF based on the surface of the SMPL model. This involves projecting input points onto the model's surface and calculating distances to the surface. Third, they incorporate the SMPL model's joint rotations to animate it in a variety of poses based on the video. This adds important pose-dependent shape cues. Finally, they use a neural network module to further refine the coordinate transform, correcting any inaccuracies in the SMPL fit to ensure spatial alignments are accurate for rendering. In experiments, they demonstrate their method generates high-quality renderings of subjects in novel views and poses not seen in the original video footage. The results capture nuanced clothing and hair deformations in a pose-dependent way. There are some example photos in the article that really show this off. Limitations exist for handling extremely complex motions and generating detailed face/hand geometry from low-resolution videos. But overall, the technique significantly advances the state-of-the-art in reconstructing animatable human models from monocular video. TLDR: They found a new NeRF technique to turn videos into controllable 3D models Full paper summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]
    The Biggest Update for ChatGPT
    submitted by /u/Senior_tasteey [link] [comments]
    Report finds that people are comfortable using AI for menial tasks at work but they would much rather rely on or engage with a human for the more personal and subjective aspects of their job.
    submitted by /u/Sy3Zy3Gy3 [link] [comments]
    cosmictrip.space updated with DALL-E 3: space and sci-fi images generated by prompts written by GPT-4
    submitted by /u/cryptoz [link] [comments]
    Is there a way to AI cover 60+ minutes of voice at one time?
    I'm in a pickle.. I want to make a podcast.. I don't want my voice to be public on it (but a friend of mine is willing to let me use them as a model).. And I don't have to cut audio into little clips, run them through the software that covers the audio, and restitch it back into one big file.. Is there any way to overcome all those dilemmas or am I just gonna have to deal with a few? Just figured I'd ask if anyone had any experience with ai voice podcasts just in case. submitted by /u/spazzyvomit916 [link] [comments]
    Trump versus Human-Sized Crab
    submitted by /u/co8222 [link] [comments]
    The first AI nation? A ship with 10,000 Nvidia H100 GPUs worth $500 million could become the first ever sovereign territory that relies entirely on artificial intelligence for its future | TechRadar
    . submitted by /u/AminoOxi [link] [comments]
    From RNNs to GPT4 - 10 years of NLP research explained in 50 concepts
    submitted by /u/AvvYaa [link] [comments]
    "The Dark Reality of Artificial Intelligence: Navigating Ethical, Societal, and Existential Challenges"
    submitted by /u/Fit-Code-5141 [link] [comments]
    Need help for NSFW Ai CHATBOT
    I Know this is a stupid question but I tried services like crushonai and more but they all require premium memberships. Is there anyway to use them completely free or any alternatives as I am in uni but live off my parents. Please help me! submitted by /u/RLIIDarK [link] [comments]
    Best Ai for generating repeating patterns?
    I’ve been testing different stuff, Craiyon does an excellent job but it’s not “private”. Dalle 3 is good but not great, Midjourney is meh (and or I just haven’t become too acquainted with it yet/don’t know any tricks). Do y’all have anything that you recommend?? Pattern examples— Wallpaper, wrapping papers, indigenous blankets, scrapbooking textures, Mossy Oak/Real Tree camo, interior design tiles, flooring, carpet samples submitted by /u/Maelasae [link] [comments]
    GREAT ANSWER FROM THE DUDE
    submitted by /u/the_anonymizer [link] [comments]
    AI spreadsheet app?
    So I am searching for an app which helps me editing a product sheet. In contains 500+ products but i need to edit the title or well delete one word and optimize the product description. What would be good? submitted by /u/MannyRibera32 [link] [comments]
    Artificial Intelligence Clashes with Copyright
    Artificial intelligence (AI) is causing clashes with copyright laws as it absorbs protected creations without authorization. Artists have complained about the theft of their work for training AI to imitate creators. The EU is preparing regulations to address this issue, while the UN held its first conference on the impact of AI. The question arises: is AI robbing artists of their work? AI poses a threat to intellectual property rights, with artists experiencing theft of their work for AI training purposes. Stephen Fry and Scarlett Johansson are among the artists who have taken legal action against unauthorized use of their voice and image by AI applications. The EU is working on regulations to make AI technology fairer, including in the creative field. AI's impact on copyright raises concerns about the loss of credit and compensation for creators. AI-generated works also raise questions about plagiarism and the economic benefit for companies. Future EU regulations aim to register generative AI systems and ensure transparency in data usage. Source : https://english.elpais.com/culture/2023-11-06/artificial-intelligence-clashes-with-copyright-is-it-stealing-thousands-of-protected-creations.html submitted by /u/NuseAI [link] [comments]
    Grandma's house, my childhood
    submitted by /u/Sea_Permit5660 [link] [comments]
    Hobbi project - Face Occlusion Detector
    submitted by /u/Gloomy_Recognition_4 [link] [comments]
    OpenAI unveils personalized AI apps as it seeks to expand its ChatGPT consumer business
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 11/6/2023
    Microsoft-backed OpenAI announces GPT-4 Turbo, its most powerful AI yet.[1] OpenAI offers to pay for ChatGPT customers’ copyright lawsuits.[2] Artificial intelligence start-up OpenAI laid out an ambitious vision for expanding its business selling directly to consumers, unveiling Monday an app-store-like marketplace where users will get paid for making chatbots on the company’s technology.[3] At OpenAI’s first developer conference, the company behind ChatGPT announced new tools that let anyone create a customized chatbot or AI agent—no coding skills required.[4] Sources: [1] https://www.cnbc.com/2023/11/06/openai-announces-more-powerful-gpt-4-turbo-and-cuts-prices.html [2] https://www.theguardian.com/technology/2023/nov/06/openai-chatgpt-customers-copyright-lawsuits [3] https://www.washingtonpost.com/technology/2023/11/06/openai-app-store-chat-gptstore/ [4] https://www.wired.com/story/openai-wants-everyone-to-build-their-own-version-of-chatgpt/ submitted by /u/Excellent-Target-847 [link] [comments]
    Whatsapp @meta AI was just released, aaand, apparently Zuck did not call anyone Dumb F__ks!
    ​ looks like the 'hand of god' was cooking the books here. https://preview.redd.it/p64tjmcyouyb1.png?width=543&format=png&auto=webp&s=627b5b7f05311a5fd0687fee20ff990b6c681a6c submitted by /u/jazz788 [link] [comments]
  • Open

    [Project] I recently completed this autonomous racing car project using DRL and mlagents toolkit.
    ​ You can checkout the demo or play the game here: https://sookeyy.itch.io/neuralnitro submitted by /u/Sookeyy [link] [comments]
    Model-based methods that don't learn Gaussians?
    I've come across a few model-based methods in continuous state spaces and the model is always a Gaussian. (In many cases, the environment itself is actually deterministic, but thats a story for another day.) Are there significant papers trying to make more powerful models work? Are there even problem settings where this is useful? I'd assume a decent starting point to model more complicated transitions is to use a noise-conditioned network, like in distributional RL. Maybe people use mixture of Gaussians, but I don't find that particularly satisfying. submitted by /u/_An_Other_Account_ [link] [comments]
    What is the difference between the RL environments with or without the terminated function?
    Hi! Recently, I read some examples of gymnasium environments and found that the terminated function is not defined in all environments. According to my understanding, the terminated function is used to terminate the current episode when the goal has been achieved. In some environments without the terminated function, will the episode continue to roll even though the goal has been achieved? Besides, some environments seem to omit the definition of the terminated function. Thus, I'm confused about the necessity of this function/property. Hope can get a detailed explanation, and that would be greatly appreciated! submitted by /u/UpperSearch4172 [link] [comments]
    After david silver's course
    Can i just dive into reading papers? Without berkeley's DRL course? submitted by /u/RealJuney [link] [comments]
  • Open

    Alternating updates for efficient transformers
    Posted by Xin Wang, Software Engineer, and Nishanth Dikkala, Research Scientist, Google Research Contemporary deep learning models have been remarkably successful in many domains, ranging from natural language to computer vision. Transformer neural networks (transformers) are a popular deep learning architecture that today comprise the foundation for most tasks in natural language processing and also are starting to extend to applications in other domains, such as computer vision, robotics, and autonomous driving. Moreover, they form the backbone of all the current state-of-the-art language models. Increasing scale in Transformer networks has led to improved performance and the emergence of behavior not present in smaller networks. However, this increase in scale often comes with pr…  ( 93 min )
  • Open

    Harnessing the power of enterprise data with generative AI: Insights from Amazon Kendra, LangChain, and large language models
    Large language models (LLMs) with their broad knowledge, can generate human-like text on almost any topic. However, their training on massive datasets also limits their usefulness for specialized tasks. Without continued learning, these models remain oblivious to new data and trends that emerge after their initial training. Furthermore, the cost to train new LLMs can […]  ( 14 min )
  • Open

    Toward developing faster algorithms for minimizing submodular functions
    This research paper was presented at the 64th IEEE Symposium on Foundations of Computer Science (FOCS) 2023 (opens in new tab), a premier forum for the latest research in theoretical computer science. Submodular functions are versatile mathematical tools, finding diverse applications in real-world scenarios and guiding solutions across complex domains. From dissecting the intricate networks […] The post Toward developing faster algorithms for minimizing submodular functions appeared first on Microsoft Research.  ( 10 min )
  • Open

    Digital Artist Steven Tung Shows Off So-fish-ticated Style This Week ‘In the NVIDIA Studio’
    Taiwanese artist Steven Tung creates captivating 2D and 3D digital art that explores sci-fi, minimalism and realism and pushes artistic boundaries.  ( 6 min )
  • Open

    Scalable Transformer for PDE Surrogate Modeling. (arXiv:2305.17560v2 [cs.LG] UPDATED)
    Transformer has shown state-of-the-art performance on various applications and has recently emerged as a promising tool for surrogate modeling of partial differential equations (PDEs). Despite the introduction of linear-complexity attention, applying Transformer to problems with a large number of grid points can be numerically unstable and computationally expensive. In this work, we propose Factorized Transformer (FactFormer), which is based on an axial factorized kernel integral. Concretely, we introduce a learnable projection operator that decomposes the input function into multiple sub-functions with one-dimensional domain. These sub-functions are then evaluated and used to compute the instance-based kernel with an axial factorized scheme. We showcase that the proposed model is able to simulate 2D Kolmogorov flow on a $256\times 256$ grid and 3D smoke buoyancy on a $64\times64\times64$ grid with good accuracy and efficiency. The proposed factorized scheme can serve as a computationally efficient low-rank surrogate for the full attention scheme when dealing with multi-dimensional problems.  ( 2 min )
    Graph Neural Networks with polynomial activations have limited expressivity. (arXiv:2310.13139v2 [cs.LG] UPDATED)
    The expressivity of Graph Neural Networks (GNNs) can be entirely characterized by appropriate fragments of the first-order logic. Namely, any query of the two variable fragment of graded modal logic (GC2) interpreted over labeled graphs can be expressed using a GNN whose size depends only on the depth of the query. As pointed out by [Barcelo & Al., 2020, Grohe, 2021], this description holds for a family of activation functions, leaving the possibility for a hierarchy of logics expressible by GNNs depending on the chosen activation function. In this article, we show that such hierarchy indeed exists by proving that GC2 queries cannot be expressed by GNNs with polynomial activation functions. This implies a separation between polynomial and popular non-polynomial activations (such as ReLUs, sigmoid and hyperbolic tan and others) and answers an open question formulated by [Grohe, 2021].  ( 2 min )
    Separable PINN: Mitigating the Curse of Dimensionality in Physics-Informed Neural Networks. (arXiv:2211.08761v3 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) have emerged as new data-driven PDE solvers for both forward and inverse problems. While promising, the expensive computational costs to obtain solutions often restrict their broader applicability. We demonstrate that the computations in automatic differentiation (AD) can be significantly reduced by leveraging forward-mode AD when training PINN. However, a naive application of forward-mode AD to conventional PINNs results in higher computation, losing its practical benefit. Therefore, we propose a network architecture, called separable PINN (SPINN), which can facilitate forward-mode AD for more efficient computation. SPINN operates on a per-axis basis instead of point-wise processing in conventional PINNs, decreasing the number of network forward passes. Besides, while the computation and memory costs of standard PINNs grow exponentially along with the grid resolution, that of our model is remarkably less susceptible, mitigating the curse of dimensionality. We demonstrate the effectiveness of our model in various PDE systems by significantly reducing the training run-time while achieving comparable accuracy. Project page: https://jwcho5576.github.io/spinn/  ( 2 min )
    Prompt Engineering Through the Lens of Optimal Control. (arXiv:2310.14201v2 [cs.LG] UPDATED)
    Prompt Engineering (PE) has emerged as a critical technique for guiding Large Language Models (LLMs) in solving intricate tasks. Its importance is highlighted by its potential to significantly enhance the efficiency and effectiveness of human-machine interaction. As tasks grow increasingly complex, recent advanced PE methods have extended beyond the limitations of single-round interactions to embrace multi-round interactions, which allows for a deeper and more nuanced engagement with LLMs. In this paper, we propose an optimal control framework tailored for multi-round interactions with LLMs. This framework provides a unified mathematical structure that not only systematizes the existing PE methods but also sets the stage for rigorous analytical improvements. Furthermore, we extend this framework to include PE via ensemble methods and multi-agent collaboration, thereby enlarging the scope of applicability. By adopting an optimal control perspective, we offer fresh insights into existing PE methods and highlight theoretical challenges that warrant future research. Besides, our work lays a foundation for the development of more effective and interpretable PE methods.  ( 2 min )
    Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference. (arXiv:2305.13484v3 [cs.DC] UPDATED)
    Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.  ( 3 min )
    Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization. (arXiv:2310.18860v2 [stat.ML] UPDATED)
    We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).  ( 3 min )
    Modeling Dynamics over Meshes with Gauge Equivariant Nonlinear Message Passing. (arXiv:2310.19589v2 [cs.LG] UPDATED)
    Data over non-Euclidean manifolds, often discretized as surface meshes, naturally arise in computer graphics and biological and physical systems. In particular, solutions to partial differential equations (PDEs) over manifolds depend critically on the underlying geometry. While graph neural networks have been successfully applied to PDEs, they do not incorporate surface geometry and do not consider local gauge symmetries of the manifold. Alternatively, recent works on gauge equivariant convolutional and attentional architectures on meshes leverage the underlying geometry but underperform in modeling surface PDEs with complex nonlinear dynamics. To address these issues, we introduce a new gauge equivariant architecture using nonlinear message passing. Our novel architecture achieves higher performance than either convolutional or attentional networks on domains with highly complex and nonlinear dynamics. However, similar to the non-mesh case, design trade-offs favor convolutional, attentional, or message passing networks for different tasks; we investigate in which circumstances our message passing method provides the most benefit.  ( 2 min )
    Finite-Time Logarithmic Bayes Regret Upper Bounds. (arXiv:2306.09136v2 [cs.LG] UPDATED)
    We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In Gaussian bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of random bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the existing lower bounds.  ( 2 min )
    Recognition of Unseen Bird Species by Learning from Field Guides. (arXiv:2206.01466v2 [cs.CV] UPDATED)
    We exploit field guides to learn bird species recognition, in particular zero-shot recognition of unseen species. Illustrations contained in field guides deliberately focus on discriminative properties of each species, and can serve as side information to transfer knowledge from seen to unseen bird species. We study two approaches: (1) a contrastive encoding of illustrations, which can be fed into standard zero-shot learning schemes; and (2) a novel method that leverages the fact that illustrations are also images and as such structurally more similar to photographs than other kinds of side information. Our results show that illustrations from field guides, which are readily available for a wide range of species, are indeed a competitive source of side information for zero-shot learning. On a subset of the iNaturalist2021 dataset with 749 seen and 739 unseen species, we obtain a classification accuracy of unseen bird species of $12\%$ @top-1 and $38\%$ @top-10, which shows the potential of field guides for challenging real-world scenarios with many species. Our code is available at https://github.com/ac-rodriguez/zsl_billow  ( 2 min )
    General Purpose Artificial Intelligence Systems (GPAIS): Properties, Definition, Taxonomy, Societal Implications and Responsible Governance. (arXiv:2307.14283v2 [cs.AI] UPDATED)
    Most applications of Artificial Intelligence (AI) are designed for a confined and specific task. However, there are many scenarios that call for a more general AI, capable of solving a wide array of tasks without being specifically designed for them. The term General-Purpose Artificial Intelligence Systems (GPAIS) has been defined to refer to these AI systems. To date, the possibility of an Artificial General Intelligence, powerful enough to perform any intellectual task as if it were human, or even improve it, has remained an aspiration, fiction, and considered a risk for our society. Whilst we might still be far from achieving that, GPAIS is a reality and sitting at the forefront of AI research. This work discusses existing definitions for GPAIS and proposes a new definition that allows for a gradual differentiation among types of GPAIS according to their properties and limitations. We distinguish between closed-world and open-world GPAIS, characterising their degree of autonomy and ability based on several factors such as adaptation to new tasks, competence in domains not intentionally trained for, ability to learn from few data, or proactive acknowledgment of their own limitations. We propose a taxonomy of approaches to realise GPAIS, describing research trends such as the use of AI techniques to improve another AI (AI-powered AI) or (single) foundation models. As a prime example, we delve into GenAI, aligning them with the concepts presented in the taxonomy. We explore multi-modality, which involves fusing various types of data sources to expand the capabilities of GPAIS. Through the proposed definition and taxonomy, our aim is to facilitate research collaboration across different areas that are tackling general purpose tasks, as they share many common aspects. Finally, we discuss the state of GPAIS, prospects, societal implications, and the need for regulation and governance.  ( 3 min )
    Numerical influence of ReLU'(0) on backpropagation. (arXiv:2106.12915v4 [cs.LG] UPDATED)
    In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU'(0) = 0 seems to be the most efficient. For our experiments on ImageNet the gain in test accuracy over ReLU'(0) = 1 was more than 10 points (two runs). We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of ReLU'(0)'s value. Overall, the message we convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously.  ( 2 min )
    Learning COVID-19 Regional Transmission Using Universal Differential Equations in a SIR model. (arXiv:2310.16804v2 [cs.LG] UPDATED)
    Highly-interconnected societies difficult to model the spread of infectious diseases such as COVID-19. Single-region SIR models fail to account for incoming forces of infection and expanding them to a large number of interacting regions involves many assumptions that do not hold in the real world. We propose using Universal Differential Equations (UDEs) to capture the influence of neighboring regions and improve the model's predictions in a combined SIR+UDE model. UDEs are differential equations totally or partially defined by a deep neural network (DNN). We include an additive term to the SIR equations composed by a DNN that learns the incoming force of infection from the other regions. The learning is performed using automatic differentiation and gradient descent to approach the change in the target system caused by the state of the neighboring regions. We compared the proposed model using a simulated COVID-19 outbreak against a single-region SIR and a fully data-driven model composed only of a DNN. The proposed UDE+SIR model generates predictions that capture the outbreak dynamic more accurately, but a decay in performance is observed at the last stages of the outbreak. The single-area SIR and the fully data-driven approach do not capture the proper dynamics accurately. Once the predictions were obtained, we employed the SINDy algorithm to substitute the DNN with a regression, removing the black box element of the model with no considerable increase in the error levels.  ( 3 min )
    Learning Extrinsic Dexterity with Parameterized Manipulation Primitives. (arXiv:2310.17785v2 [cs.RO] UPDATED)
    Many practically relevant robot grasping problems feature a target object for which all grasps are occluded, e.g., by the environment. Single-shot grasp planning invariably fails in such scenarios. Instead, it is necessary to first manipulate the object into a configuration that affords a grasp. We solve this problem by learning a sequence of actions that utilize the environment to change the object's pose. Concretely, we employ hierarchical reinforcement learning to combine a sequence of learned parameterized manipulation primitives. By learning the low-level manipulation policies, our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment. Designing such a complex behavior analytically would be infeasible under uncontrolled conditions, as an analytic approach requires accurate physical modeling of the interaction and contact dynamics. In contrast, we learn a hierarchical policy model that operates directly on depth perception data, without the need for object detection, pose estimation, or manual design of controllers. We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace. Our method transfers to a real robot and is able to successfully complete the object picking task in 98\% of experimental trials.  ( 2 min )
    On minimizers and convolutional filters: theoretical connections and applications to genome analysis. (arXiv:2111.08452v5 [cs.LG] UPDATED)
    Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.  ( 3 min )
    Detecting Pretraining Data from Large Language Models. (arXiv:2310.16789v2 [cs.CL] UPDATED)
    Although large language models (LLMs) are widely deployed, the data used to train them is rarely disclosed. Given the incredible scale of this data, up to trillions of tokens, it is all but certain that it includes potentially problematic text such as copyrighted materials, personally identifiable information, and test data for widely reported reference benchmarks. However, we currently have no way to know which data of these types is included or in what proportions. In this paper, we study the pretraining data detection problem: given a piece of text and black-box access to an LLM without knowing the pretraining data, can we determine if the model was trained on the provided text? To facilitate this study, we introduce a dynamic benchmark WIKIMIA that uses data created before and after model training to support gold truth detection. We also introduce a new detection method Min-K% Prob based on a simple hypothesis: an unseen example is likely to contain a few outlier words with low probabilities under the LLM, while a seen example is less likely to have words with such low probabilities. Min-K% Prob can be applied without any knowledge about the pretraining corpus or any additional training, departing from previous detection methods that require training a reference model on data that is similar to the pretraining data. Moreover, our experiments demonstrate that Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previous methods. We apply Min-K% Prob to three real-world scenarios, copyrighted book detection, contaminated downstream example detection and privacy auditing of machine unlearning, and find it a consistently effective solution.  ( 3 min )
    The language of prompting: What linguistic properties make a prompt successful?. (arXiv:2311.01967v1 [cs.CL])
    The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However, since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned, perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust and comprehensive evaluation standard for prompting research.  ( 2 min )
    Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review. (arXiv:2311.01918v1 [cs.CL])
    With the rapid development of artificial intelligence, large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This has sparked significant interest in applying LLMs to enhance various aspects of healthcare, ranging from medical education to clinical decision support. However, medicine involves multifaceted data modalities and nuanced reasoning skills, presenting challenges for integrating LLMs. This paper provides a comprehensive review on the applications and implications of LLMs in medicine. It begins by examining the fundamental applications of general-purpose and specialized LLMs, demonstrating their utilities in knowledge retrieval, research support, clinical workflow automation, and diagnostic assistance. Recognizing the inherent multimodality of medicine, the review then focuses on multimodal LLMs, investigating their ability to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy. To address LLMs' limitations regarding personalization and complex clinical reasoning, the paper explores the emerging development of LLM-powered autonomous agents for healthcare. Furthermore, it summarizes the evaluation methodologies for assessing LLMs' reliability and safety in medical contexts. Overall, this review offers an extensive analysis on the transformative potential of LLMs in modern medicine. It also highlights the pivotal need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice. Visit https://github.com/mingze-yuan/Awesome-LLM-Healthcare for an accompanying GitHub repository containing latest papers.  ( 3 min )
    Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization. (arXiv:2311.01900v1 [stat.ML])
    Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -- to our best knowledge -- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated online. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.  ( 2 min )
    Advancing Bayesian Optimization via Learning Correlated Latent Space. (arXiv:2310.20258v2 [cs.LG] UPDATED)
    Bayesian optimization is a powerful method for optimizing black-box functions with limited function evaluations. Recent works have shown that optimization in a latent space through deep generative models such as variational autoencoders leads to effective and efficient Bayesian optimization for structured or discrete data. However, as the optimization does not take place in the input space, it leads to an inherent gap that results in potentially suboptimal solutions. To alleviate the discrepancy, we propose Correlated latent space Bayesian Optimization (CoBO), which focuses on learning correlated latent spaces characterized by a strong correlation between the distances in the latent space and the distances within the objective function. Specifically, our method introduces Lipschitz regularization, loss weighting, and trust region recoordination to minimize the inherent gap around the promising areas. We demonstrate the effectiveness of our approach on several optimization tasks in discrete data, such as molecule design and arithmetic expression fitting, and achieve high performance within a small budget.
    Learning UI-to-Code Reverse Generator Using Visual Critic Without Rendering. (arXiv:2305.14637v2 [cs.CV] UPDATED)
    Automated reverse engineering of HTML/CSS code from UI screenshots is an important yet challenging problem with broad applications in website development and design. In this paper, we propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code. They are initialized by pre-trained models such as ViT/DiT and GPT-2/LLaMA but aligning the two modalities requires end-to-end finetuning, which aims to minimize the visual discrepancy between the code-rendered webpage and the original screenshot. However, the rendering is non-differentiable and causes costly overhead. We address this problem by actor-critic fine-tuning where a visual critic without rendering (ViCR) is developed to predict visual discrepancy given the original and generated code. To train and evaluate our models, we created two synthetic datasets of varying complexity, with over 75,000 unique (code, screenshot) pairs. We evaluate the UI-to-Code performance using a combination of automated metrics such as MSE, BLEU, IoU, and a novel htmlBLEU score. ViCT outperforms a strong baseline model DiT-GPT2, improving IoU from 0.64 to 0.79 and lowering MSE from 12.25 to 9.02. With much lower computational cost, it can achieve comparable performance as when using a larger decoder such as LLaMA.  ( 2 min )
    Dropout Strategy in Reinforcement Learning: Limiting the Surrogate Objective Variance in Policy Optimization Methods. (arXiv:2310.20380v3 [cs.LG] UPDATED)
    Policy-based reinforcement learning algorithms are widely used in various fields. Among them, mainstream policy optimization algorithms such as TRPO and PPO introduce importance sampling into policy iteration, which allows the reuse of historical data. However, this can also lead to a high variance of the surrogate objective and indirectly affects the stability and convergence of the algorithm. In this paper, we first derived an upper bound of the surrogate objective variance, which can grow quadratically with the increase of the surrogate objective. Next, we proposed the dropout technique to avoid the excessive increase of the surrogate objective variance caused by importance sampling. Then, we introduced a general reinforcement learning framework applicable to mainstream policy optimization methods, and applied the dropout technique to the PPO algorithm to obtain the D-PPO variant. Finally, we conduct comparative experiments between D-PPO and PPO algorithms in the Atari 2600 environment, and the results show that D-PPO achieved significant performance improvements compared to PPO, and effectively limited the excessive increase of the surrogate objective variance during training.
    CapsFusion: Rethinking Image-Text Data at Scale. (arXiv:2310.20550v2 [cs.CV] UPDATED)
    Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.
    Ensemble models outperform single model uncertainties and predictions for operator-learning of hypersonic flows. (arXiv:2311.00060v2 [physics.flu-dyn] UPDATED)
    High-fidelity computational simulations and physical experiments of hypersonic flows are resource intensive. Training scientific machine learning (SciML) models on limited high-fidelity data offers one approach to rapidly predict behaviors for situations that have not been seen before. However, high-fidelity data is itself in limited quantity to validate all outputs of the SciML model in unexplored input space. As such, an uncertainty-aware SciML model is desired. The SciML model's output uncertainties could then be used to assess the reliability and confidence of the model's predictions. In this study, we extend a DeepONet using three different uncertainty quantification mechanisms: mean-variance estimation, evidential uncertainty, and ensembling. The uncertainty aware DeepONet models are trained and evaluated on the hypersonic flow around a blunt cone object with data generated via computational fluid dynamics over a wide range of Mach numbers and altitudes. We find that ensembling outperforms the other two uncertainty models in terms of minimizing error and calibrating uncertainty in both interpolative and extrapolative regimes.
    Domain Randomization via Entropy Maximization. (arXiv:2311.01885v1 [cs.LG])
    Varying dynamics parameters in simulation is a popular Domain Randomization (DR) approach for overcoming the reality gap in Reinforcement Learning (RL). Nevertheless, DR heavily hinges on the choice of the sampling distribution of the dynamics parameters, since high variability is crucial to regularize the agent's behavior but notoriously leads to overly conservative policies when randomizing excessively. In this paper, we propose a novel approach to address sim-to-real transfer, which automatically shapes dynamics distributions during training in simulation without requiring real-world data. We introduce DOmain RAndomization via Entropy MaximizatiON (DORAEMON), a constrained optimization problem that directly maximizes the entropy of the training distribution while retaining generalization capabilities. In achieving this, DORAEMON gradually increases the diversity of sampled dynamics parameters as long as the probability of success of the current policy is sufficiently high. We empirically validate the consistent benefits of DORAEMON in obtaining highly adaptive and generalizable policies, i.e. solving the task at hand across the widest range of dynamics parameters, as opposed to representative baselines from the DR literature. Notably, we also demonstrate the Sim2Real applicability of DORAEMON through its successful zero-shot transfer in a robotic manipulation setup under unknown real-world parameters.  ( 2 min )
    Managing AI Risks in an Era of Rapid Progress. (arXiv:2310.17688v1 [cs.CY] CROSS LISTED)
    In this short consensus paper, we outline risks from upcoming, advanced AI systems. We examine large-scale social harms and malicious uses, as well as an irreversible loss of human control over autonomous AI systems. In light of rapid and continuing AI progress, we propose priorities for AI R&D and governance.
    When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment. (arXiv:2307.03864v4 [cs.LG] UPDATED)
    Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations $1500$ steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design. Our code is open-sourced at https://github.com/twni2016/Memory-RL  ( 2 min )
    LLQL: Logistic Likelihood Q-Learning for Reinforcement Learning. (arXiv:2307.02345v3 [cs.LG] UPDATED)
    Modern reinforcement learning (RL) can be categorized into online and offline variants. As a pivotal aspect of both online and offline RL, current research on the Bellman equation revolves primarily around optimization techniques and performance enhancement rather than exploring the inherent structural properties of the Bellman error, such as its distribution characteristics. This study investigates the distribution of the Bellman approximation error in both online and offline settings through iterative exploration of the Bellman equation. We observed that both in online RL and offline RL, the Bellman error conforms to a Logistic distribution. Building upon this discovery, this study employed the Logistics maximum likelihood function (LLoss) as an alternative to the commonly used MSE Loss, assuming that Bellman errors adhere to a normal distribution. We validated our hypotheses through extensive numerical experiments across diverse online and offline environments. In particular, we applied corrections to the loss function across various baseline algorithms and consistently observed that the loss function with Logistic corrections outperformed the MSE counterpart significantly. Additionally, we conducted Kolmogorov-Smirnov tests to confirm the reliability of the Logistic distribution. This study's theoretical and empirical insights provide valuable groundwork for future investigations and enhancements centered on the distribution of Bellman errors.  ( 2 min )
    Improving Intrinsic Exploration by Creating Stationary Objectives. (arXiv:2310.18144v2 [cs.LG] UPDATED)
    Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Count-based methods use the frequency of state visits to derive an exploration bonus. In this paper, we identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments.  ( 2 min )
    CDGraph: Dual Conditional Social Graph Synthesizing via Diffusion Model. (arXiv:2311.01729v1 [cs.SI])
    The social graphs synthesized by the generative models are increasingly in demand due to data scarcity and concerns over user privacy. One of the key performance criteria for generating social networks is the fidelity to specified conditionals, such as users with certain membership and financial status. While recent diffusion models have shown remarkable performance in generating images, their effectiveness in synthesizing graphs has not yet been explored in the context of conditional social graphs. In this paper, we propose the first kind of conditional diffusion model for social networks, CDGraph, which trains and synthesizes graphs based on two specified conditions. We propose the co-evolution dependency in the denoising process of CDGraph to capture the mutual dependencies between the dual conditions and further incorporate social homophily and social contagion to preserve the connectivity between nodes while satisfying the specified conditions. Moreover, we introduce a novel classifier loss, which guides the training of the diffusion process through the mutual dependency of dual conditions. We evaluate CDGraph against four existing graph generative methods, i.e., SPECTRE, GSM, EDGE, and DiGress, on four datasets. Our results show that the generated graphs from CDGraph achieve much higher dual-conditional validity and lower discrepancy in various social network metrics than the baselines, thus demonstrating its proficiency in generating dual-conditional social graphs.  ( 2 min )
    Guiding Language Models of Code with Global Context using Monitors. (arXiv:2306.10763v2 [cs.CL] UPDATED)
    Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen .  ( 3 min )
    Provably Convergent Data-Driven Convex-Nonconvex Regularization. (arXiv:2310.05812v2 [cs.LG] UPDATED)
    An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.  ( 2 min )
    General Anomaly Detection of Underwater Gliders Validated by Large-scale Deployment Datasets. (arXiv:2308.00180v3 [cs.RO] UPDATED)
    Underwater gliders have been widely used in oceanography for a range of applications. However, unpredictable events like shark strikes or remora attachments can lead to abnormal glider behavior or even loss of the instrument. This paper employs an anomaly detection algorithm to assess operational conditions of underwater gliders in the real-world ocean environment. Prompt alerts are provided to glider pilots upon detecting any anomaly, so that they can take control of the glider to prevent further harm. The detection algorithm is applied to multiple datasets collected in real glider deployments led by the University of Georgia's Skidaway Institute of Oceanography (SkIO) and the University of South Florida (USF). In order to demonstrate the algorithm generality, the experimental evaluation is applied to four glider deployment datasets, each highlighting various anomalies happening in different scenes. Specifically, we utilize high resolution datasets only available post-recovery to perform detailed analysis of the anomaly and compare it with pilot logs. Additionally, we simulate the online detection based on the real-time subsets of data transmitted from the glider at the surfacing events. While the real-time data may not contain as much rich information as the post-recovery one, the online detection is of great importance as it allows glider pilots to monitor potential abnormal conditions in real time.  ( 3 min )
    Learning nonparametric latent causal graphs with unknown interventions. (arXiv:2306.02899v2 [stat.ML] UPDATED)
    We establish conditions under which latent causal graphs are nonparametrically identifiable and can be reconstructed from unknown interventions in the latent space. Our primary focus is the identification of the latent structure in measurement models without parametric assumptions such as linearity or Gaussianity. Moreover, we do not assume the number of hidden variables is known, and we show that at most one unknown intervention per hidden variable is needed. This extends a recent line of work on learning causal representations from observations and interventions. The proofs are constructive and introduce two new graphical concepts -- imaginary subsets and isolated edges -- that may be useful in their own right. As a matter of independent interest, the proofs also involve a novel characterization of the limits of edge orientations within the equivalence class of DAGs induced by unknown interventions. These are the first results to characterize the conditions under which causal representations are identifiable without making any parametric assumptions in a general setting with unknown interventions and without faithfulness.  ( 2 min )
    DeliverAI: Reinforcement Learning Based Distributed Path-Sharing Network for Food Deliveries. (arXiv:2311.02017v1 [cs.LG])
    Delivery of items from the producer to the consumer has experienced significant growth over the past decade and has been greatly fueled by the recent pandemic. Amazon Fresh, Shopify, UberEats, InstaCart, and DoorDash are rapidly growing and are sharing the same business model of consumer items or food delivery. Existing food delivery methods are sub-optimal because each delivery is individually optimized to go directly from the producer to the consumer via the shortest time path. We observe a significant scope for reducing the costs associated with completing deliveries under the current model. We model our food delivery problem as a multi-objective optimization, where consumer satisfaction and delivery costs, both, need to be optimized. Taking inspiration from the success of ride-sharing in the taxi industry, we propose DeliverAI - a reinforcement learning-based path-sharing algorithm. Unlike previous attempts for path-sharing, DeliverAI can provide real-time, time-efficient decision-making using a Reinforcement learning-enabled agent system. Our novel agent interaction scheme leverages path-sharing among deliveries to reduce the total distance traveled while keeping the delivery completion time under check. We generate and test our methodology vigorously on a simulation setup using real data from the city of Chicago. Our results show that DeliverAI can reduce the delivery fleet size by 12\%, the distance traveled by 13%, and achieve 50% higher fleet utilization compared to the baselines.
    Anytime-Competitive Reinforcement Learning with Policy Prior. (arXiv:2311.01568v1 [cs.LG])
    This paper studies the problem of Anytime-Competitive Markov Decision Process (A-CMDP). Existing works on Constrained Markov Decision Processes (CMDPs) aim to optimize the expected reward while constraining the expected cost over random dynamics, but the cost in a specific episode can still be unsatisfactorily high. In contrast, the goal of A-CMDP is to optimize the expected reward while guaranteeing a bounded cost in each round of any episode against a policy prior. We propose a new algorithm, called Anytime-Competitive Reinforcement Learning (ACRL), which provably guarantees the anytime cost constraints. The regret analysis shows the policy asymptotically matches the optimal reward achievable under the anytime competitive constraints. Experiments on the application of carbon-intelligent computing verify the reward performance and cost constraint guarantee of ACRL.
    A large-scale and PCR-referenced vocal audio dataset for COVID-19. (arXiv:2212.07738v4 [cs.SD] UPDATED)
    The UK COVID-19 Vocal Audio Dataset is designed for the training and evaluation of machine learning models that classify SARS-CoV-2 infection status or associated respiratory symptoms using vocal audio. The UK Health Security Agency recruited voluntary participants through the national Test and Trace programme and the REACT-1 survey in England from March 2021 to March 2022, during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and some Omicron variant sublineages. Audio recordings of volitional coughs, exhalations, and speech were collected in the 'Speak up to help beat coronavirus' digital survey alongside demographic, self-reported symptom and respiratory condition data, and linked to SARS-CoV-2 test results. The UK COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2 PCR-referenced audio recordings to date. PCR results were linked to 70,794 of 72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms were reported by 45.62% of participants. This dataset has additional potential uses for bioacoustics research, with 11.30% participants reporting asthma, and 27.20% with linked influenza PCR test results.
    Improving Fairness using Vision-Language Driven Image Augmentation. (arXiv:2311.01573v1 [cs.CV])
    Fairness is crucial when training a deep-learning discriminative model, especially in the facial domain. Models tend to correlate specific characteristics (such as age and skin color) with unrelated attributes (downstream tasks), resulting in biases which do not correspond to reality. It is common knowledge that these correlations are present in the data and are then transferred to the models during training. This paper proposes a method to mitigate these correlations to improve fairness. To do so, we learn interpretable and meaningful paths lying in the semantic space of a pre-trained diffusion model (DiffAE) -- such paths being supervised by contrastive text dipoles. That is, we learn to edit protected characteristics (age and skin color). These paths are then applied to augment images to improve the fairness of a given dataset. We test the proposed method on CelebA-HQ and UTKFace on several downstream tasks with age and skin color as protected characteristics. As a proxy for fairness, we compute the difference in accuracy with respect to the protected characteristics. Quantitative results show how the augmented images help the model improve the overall accuracy, the aforementioned metric, and the disparity of equal opportunity. Code is available at: https://github.com/Moreno98/Vision-Language-Bias-Control.
    Convex and Non-convex Optimization Under Generalized Smoothness. (arXiv:2306.01264v2 [math.OC] UPDATED)
    Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
    ChatGPT for GTFS: Benchmarking LLMs on GTFS Understanding and Retrieval. (arXiv:2308.02618v2 [cs.IR] UPDATED)
    The General Transit Feed Specification (GTFS) standard for publishing transit data is ubiquitous. GTFS being tabular data, with information spread across different files, necessitates specialized tools or packages to retrieve information. Concurrently, the use of Large Language Models(LLMs) for text and information retrieval is growing. The idea of this research is to see if the current widely adopted LLMs (ChatGPT) are able to understand GTFS and retrieve information from GTFS using natural language instructions without explicitly providing information. In this research, we benchmark OpenAI's GPT-3.5-Turbo and GPT-4 LLMs which are the backbone of ChatGPT. ChatGPT demonstrates a reasonable understanding of GTFS by answering 59.7% (GPT-3.5-Turbo) and 73.3% (GPT-4) of our multiple-choice questions (MCQ) correctly. Furthermore, we evaluated the LLMs on information extraction tasks using a filtered GTFS feed containing four routes. We found that program synthesis techniques outperformed zero-shot approaches, achieving up to 93% (90%) accuracy for simple queries and 61% (41%) for complex ones using GPT-4 (GPT-3.5-Turbo).
    Detecting and Mitigating Mode-Collapse for Flow-based Sampling of Lattice Field Theories. (arXiv:2302.14082v2 [hep-lat] UPDATED)
    We study the consequences of mode-collapse of normalizing flows in the context of lattice field theory. Normalizing flows allow for independent sampling. For this reason, it is hoped that they can avoid the tunneling problem of local-update MCMC algorithms for multi-modal distributions. In this work, we first point out that the tunneling problem is also present for normalizing flows but is shifted from the sampling to the training phase of the algorithm. Specifically, normalizing flows often suffer from mode-collapse for which the training process assigns vanishingly low probability mass to relevant modes of the physical distribution. This may result in a significant bias when the flow is used as a sampler in a Markov-Chain or with Importance Sampling. We propose a metric to quantify the degree of mode-collapse and derive a bound on the resulting bias. Furthermore, we propose various mitigation strategies in particular in the context of estimating thermodynamic observables, such as the free energy.
    Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization. (arXiv:2307.02108v3 [cs.LG] UPDATED)
    In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on "conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.
    Latent Diffusion Model for Conditional Reservoir Facies Generation. (arXiv:2311.01968v1 [physics.geo-ph])
    Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.
    A Unified Approach for Maximizing Continuous DR-submodular Functions. (arXiv:2305.16671v2 [cs.LG] UPDATED)
    This paper presents a unified approach for maximizing continuous DR-submodular functions that encompasses a range of settings and oracle access types. Our approach includes a Frank-Wolfe type offline algorithm for both monotone and non-monotone functions, with different restrictions on the general convex set. We consider settings where the oracle provides access to either the gradient of the function or only the function value, and where the oracle access is either deterministic or stochastic. We determine the number of required oracle accesses in all cases. Our approach gives new/improved results for nine out of the sixteen considered cases, avoids computationally expensive projections in two cases, with the proposed framework matching performance of state-of-the-art approaches in the remaining five cases. Notably, our approach for the stochastic function value-based oracle enables the first regret bounds with bandit feedback for stochastic DR-submodular functions.
    Multi-Task Learning to Enhance Generalizability of Neural Network Equalizers in Coherent Optical Systems. (arXiv:2307.05374v3 [eess.SP] UPDATED)
    For the first time, multi-task learning is proposed to improve the flexibility of NN-based equalizers in coherent systems. A "single" NN-based equalizer improves Q-factor by up to 4 dB compared to CDC, without re-training, even with variations in launch power, symbol rate, or transmission distance.
    Transport, Variational Inference and Diffusions: with Applications to Annealed Flows and Schr\"odinger Bridges. (arXiv:2307.01050v3 [stat.ML] UPDATED)
    This paper explores the connections between optimal transport and variational inference, with a focus on forward and reverse time stochastic differential equations and Girsanov transformations.We present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of a novel score-based annealed flow technique (with connections to Jarzynski and Crooks identities from statistical physics) and a regularised iterative proportional fitting (IPF)-type objective, departing from the sequential nature of standard IPF. Through a series of generative modelling examples and a double-well-based rare event task, we showcase the potential of the proposed methods.
    Long Sequence Hopfield Memory. (arXiv:2306.04532v2 [cs.NE] CROSS LISTED)
    Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.
    Differentially Private Topological Data Analysis. (arXiv:2305.03609v2 [stat.ML] UPDATED)
    This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used \v{C}ech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of \v{C}ech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real dataset tracking human movement.
    Bucks for Buckets (B4B): Active Defenses Against Stealing Encoders. (arXiv:2310.08571v2 [cs.LG] UPDATED)
    Machine Learning as a Service (MLaaS) APIs provide ready-to-use and high-utility encoders that generate vector representations for given inputs. Since these encoders are very costly to train, they become lucrative targets for model stealing attacks during which an adversary leverages query access to the API to replicate the encoder locally at a fraction of the original training costs. We propose Bucks for Buckets (B4B), the first active defense that prevents stealing while the attack is happening without degrading representation quality for legitimate API users. Our defense relies on the observation that the representations returned to adversaries who try to steal the encoder's functionality cover a significantly larger fraction of the embedding space than representations of legitimate users who utilize the encoder to solve a particular downstream task.vB4B leverages this to adaptively adjust the utility of the returned representations according to a user's coverage of the embedding space. To prevent adaptive adversaries from eluding our defense by simply creating multiple user accounts (sybils), B4B also individually transforms each user's representations. This prevents the adversary from directly aggregating representations over multiple accounts to create their stolen encoder copy. Our active defense opens a new path towards securely sharing and democratizing encoders over public APIs.
    Detection of keratoconus Diseases using deep Learning. (arXiv:2311.01996v1 [eess.IV])
    One of the most serious corneal disorders, keratoconus is difficult to diagnose in its early stages and can result in blindness. This illness, which often appears in the second decade of life, affects people of all sexes and races. Convolutional neural networks (CNNs), one of the deep learning approaches, have recently come to light as particularly promising tools for the accurate and timely diagnosis of keratoconus. The purpose of this study was to evaluate how well different D-CNN models identified keratoconus-related diseases. To be more precise, we compared five different CNN-based deep learning architectures (DenseNet201, InceptionV3, MobileNetV2, VGG19, Xception). In our comprehensive experimental analysis, the DenseNet201-based model performed very well in keratoconus disease identification in our extensive experimental research. This model outperformed its D-CNN equivalents, with an astounding accuracy rate of 89.14% in three crucial classes: Keratoconus, Normal, and Suspect. The results demonstrate not only the stability and robustness of the model but also its practical usefulness in real-world applications for accurate and dependable keratoconus identification. In addition, D-CNN DenseNet201 performs extraordinarily well in terms of precision, recall rates, and F1 scores in addition to accuracy. These measures validate the model's usefulness as an effective diagnostic tool by highlighting its capacity to reliably detect instances of keratoconus and to reduce false positives and negatives.
    Reproducible Parameter Inference Using Bagged Posteriors. (arXiv:2311.02019v1 [stat.ME])
    Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.
    GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent. (arXiv:2305.03515v4 [cs.LG] UPDATED)
    Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is non-convex and non-differentiable. Therefore, common approaches learn DTs using a greedy growth algorithm that minimizes the impurity locally at each internal node. Unfortunately, this greedy procedure can lead to inaccurate trees. In this paper, we present a novel approach for learning hard, axis-aligned DTs with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation, to jointly optimize all tree parameters. Our approach outperforms existing methods on binary classification benchmarks and achieves competitive results for multi-class tasks. The method is available under: https://github.com/s-marton/GradTree
    Quantum circuit synthesis with diffusion models. (arXiv:2311.02041v1 [quant-ph])
    Quantum computing has recently emerged as a transformative technology. Yet, its promised advantages rely on efficiently translating quantum operations into viable physical realizations. In this work, we use generative machine learning models, specifically denoising diffusion models (DMs), to facilitate this transformation. Leveraging text-conditioning, we steer the model to produce desired quantum operations within gate-based quantum circuits. Notably, DMs allow to sidestep during training the exponential overhead inherent in the classical simulation of quantum dynamics -- a consistent bottleneck in preceding ML techniques. We demonstrate the model's capabilities across two tasks: entanglement generation and unitary compilation. The model excels at generating new circuits and supports typical DM extensions such as masking and editing to, for instance, align the circuit generation to the constraints of the targeted quantum device. Given their flexibility and generalization abilities, we envision DMs as pivotal in quantum circuit synthesis, enhancing both practical applications but also insights into theoretical quantum computation.
    Towards Abstractive Timeline Summarisation using Preference-based Reinforcement Learning. (arXiv:2211.07596v2 [cs.LG] UPDATED)
    This paper introduces a novel pipeline for summarising timelines of events reported by multiple news sources. Transformer-based models for abstractive summarisation generate coherent and concise summaries of long documents but can fail to outperform established extractive methods on specialised tasks such as timeline summarisation (TLS). While extractive summaries are more faithful to their sources, they may be less readable and contain redundant or unnecessary information. This paper proposes a preference-based reinforcement learning (PBRL) method for adapting pretrained abstractive summarisers to TLS, which can overcome the drawbacks of extractive timeline summaries. We define a compound reward function that learns from keywords of interest and pairwise preference labels, which we use to fine-tune a pretrained abstractive summariser via offline reinforcement learning. We carry out both automated and human evaluation on three datasets, finding that our method outperforms a comparable extractive TLS method on two of the three benchmark datasets, and participants prefer our method's summaries to those of both the extractive TLS method and the pretrained abstractive model. The method does not require expensive reference summaries and needs only a small number of preferences to align the generated summaries with human preferences.
    Remember what you did so you know what to do next. (arXiv:2311.01468v1 [cs.CL])
    We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.
    RiskQ: Risk-sensitive Multi-Agent Reinforcement Learning Value Factorization. (arXiv:2311.01753v1 [cs.MA])
    Multi-agent systems are characterized by environmental uncertainty, varying policies of agents, and partial observability, which result in significant risks. In the context of Multi-Agent Reinforcement Learning (MARL), learning coordinated and decentralized policies that are sensitive to risk is challenging. To formulate the coordination requirements in risk-sensitive MARL, we introduce the Risk-sensitive Individual-Global-Max (RIGM) principle as a generalization of the Individual-Global-Max (IGM) and Distributional IGM (DIGM) principles. This principle requires that the collection of risk-sensitive action selections of each agent should be equivalent to the risk-sensitive action selection of the central policy. Current MARL value factorization methods do not satisfy the RIGM principle for common risk metrics such as the Value at Risk (VaR) metric or distorted risk measurements. Therefore, we propose RiskQ to address this limitation, which models the joint return distribution by modeling quantiles of it as weighted quantile mixtures of per-agent return distribution utilities. RiskQ satisfies the RIGM principle for the VaR and distorted risk metrics. We show that RiskQ can obtain promising performance through extensive experiments. The source code of RiskQ is available in https://github.com/xmu-rl-3dv/RiskQ.
    Active Learning-Based Species Range Estimation. (arXiv:2311.02061v1 [cs.LG])
    We propose a new active learning approach for efficiently estimating the geographic range of a species from a limited number of on the ground observations. We model the range of an unmapped species of interest as the weighted combination of estimated ranges obtained from a set of different species. We show that it is possible to generate this candidate set of ranges by using models that have been trained on large weakly supervised community collected observation data. From this, we develop a new active querying approach that sequentially selects geographic locations to visit that best reduce our uncertainty over an unmapped species' range. We conduct a detailed evaluation of our approach and compare it to existing active learning methods using an evaluation dataset containing expert-derived ranges for one thousand species. Our results demonstrate that our method outperforms alternative active learning methods and approaches the performance of end-to-end trained models, even when only using a fraction of the data. This highlights the utility of active learning via transfer learned spatial representations for species range estimation. It also emphasizes the value of leveraging emerging large-scale crowdsourced datasets, not only for modeling a species' range, but also for actively discovering them.
    Feature-Attending Recurrent Modules for Generalization in Reinforcement Learning. (arXiv:2112.08369v3 [cs.LG] UPDATED)
    Many important tasks are defined in terms of object. To generalize across these tasks, a reinforcement learning (RL) agent needs to exploit the structure that the objects induce. Prior work has either hard-coded object-centric features, used complex object-centric generative models, or updated state using local spatial features. However, these approaches have had limited success in enabling general RL agents. Motivated by this, we introduce "Feature-Attending Recurrent Modules" (FARM), an architecture for learning state representations that relies on simple, broadly applicable inductive biases for capturing spatial and temporal regularities. FARM learns a state representation that is distributed across multiple modules that each attend to spatiotemporal features with an expressive feature attention mechanism. We show that this improves an RL agent's ability to generalize across object-centric tasks. We study task suites in both 2D and 3D environments and find that FARM better generalizes compared to competing architectures that leverage attention or multiple modules.
    Detecting Out-of-Distribution Through the Lens of Neural Collapse. (arXiv:2311.01479v1 [cs.LG])
    Out-of-distribution (OOD) detection is essential for the safe deployment of AI. Particularly, OOD detectors should generalize effectively across diverse scenarios. To improve upon the generalizability of existing OOD detectors, we introduce a highly versatile OOD detector, called Neural Collapse inspired OOD detector (NC-OOD). We extend the prevalent observation that in-distribution (ID) features tend to form clusters, whereas OOD features are far away. Particularly, based on the recent observation, Neural Collapse, we further demonstrate that ID features tend to cluster in proximity to weight vectors. From our extended observation, we propose to detect OOD based on feature proximity to weight vectors. To further rule out OOD samples, we leverage the observation that OOD features tend to reside closer to the origin than ID features. Extensive experiments show that our approach enhances the generalizability of existing work and can consistently achieve state-of-the-art OOD detection performance across a wide range of OOD Benchmarks over different classification tasks, training losses, and model architectures.
    FedSN: A General Federated Learning Framework over LEO Satellite Networks. (arXiv:2311.01483v1 [cs.LG])
    Recently, a large number of Low Earth Orbit (LEO) satellites have been launched and deployed successfully in space by commercial companies, such as SpaceX. Due to multimodal sensors equipped by the LEO satellites, they serve not only for communication but also for various machine learning applications, such as space modulation recognition, remote sensing image classification, etc. However, the ground station (GS) may be incapable of downloading such a large volume of raw sensing data for centralized model training due to the limited contact time with LEO satellites (e.g. 5 minutes). Therefore, federated learning (FL) has emerged as the promising solution to address this problem via on-device training. Unfortunately, to enable FL on LEO satellites, we still face three critical challenges that are i) heterogeneous computing and memory capabilities, ii) limited uplink rate, and iii) model staleness. To this end, we propose FedSN as a general FL framework to tackle the above challenges, and fully explore data diversity on LEO satellites. Specifically, we first present a novel sub-structure scheme to enable heterogeneous local model training considering different computing, memory, and communication constraints on LEO satellites. Additionally, we propose a pseudo-synchronous model aggregation strategy to dynamically schedule model aggregation for compensating model staleness. To further demonstrate the effectiveness of the FedSN, we evaluate it using space modulation recognition and remote sensing image classification tasks by leveraging the data from real-world satellite networks. Extensive experimental results demonstrate that FedSN framework achieves higher accuracy, lower computing, and communication overhead than the state-of-the-art benchmarks and the effectiveness of each components in FedSN.
    An Ensemble Machine Learning Approach for Screening Covid-19 based on Urine Parameters. (arXiv:2311.01854v1 [eess.IV])
    The rapid spread of COVID-19 and the emergence of new variants underscore the importance of effective screening measures. Rapid diagnosis and subsequent quarantine of infected individuals can prevent further spread of the virus in society. While PCR tests are the gold standard for COVID-19 diagnosis, they are costly and time-consuming. In contrast, urine test strips are an inexpensive, non-invasive, and rapidly obtainable screening method that can provide important information about a patient's health status. In this study, we collected a new dataset and used the RGB (Red Green Blue) color space of urine test strips parameters to detect the health status of individuals. To improve the accuracy of our model, we converted the RGB space to 10 additional color spaces. After evaluating four different machine learning models, we proposed a new ensemble model based on a multi-layer perceptron neural network. Although the initial results were not strong, we were able to improve the model's screening performance for COVID-19 by removing uncertain regions of the model space. Ultimately, our model achieved a screening accuracy of 80% based on urine parameters. Our results suggest that urine test strips can be a useful tool for COVID-19 screening, particularly in resource-constrained settings where PCR testing may not be feasible. Further research is needed to validate our findings and explore the potential role of urine test strips in COVID-19 diagnosis and management.
    LOTUS: Continual Imitation Learning for Robot Manipulation Through Unsupervised Skill Discovery. (arXiv:2311.02058v1 [cs.RO])
    We introduce LOTUS, a continual imitation learning algorithm that empowers a physical robot to continuously and efficiently learn to solve new manipulation tasks throughout its lifespan. The core idea behind LOTUS is constructing an ever-growing skill library from a sequence of new tasks with a small number of human demonstrations. LOTUS starts with a continual skill discovery process using an open-vocabulary vision model, which extracts skills as recurring patterns presented in unsegmented demonstrations. Continual skill discovery updates existing skills to avoid catastrophic forgetting of previous tasks and adds new skills to solve novel tasks. LOTUS trains a meta-controller that flexibly composes various skills to tackle vision-based manipulation tasks in the lifelong learning process. Our comprehensive experiments show that LOTUS outperforms state-of-the-art baselines by over 11% in success rate, showing its superior knowledge transfer ability compared to prior methods. More results and videos can be found on the project website: https://ut-austin-rpl.github.io/Lotus/.
    Doubly Robust Self-Training. (arXiv:2306.00265v3 [cs.LG] UPDATED)
    Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
    Global Optimization: A Machine Learning Approach. (arXiv:2311.01742v1 [math.OC])
    Many approaches for addressing Global Optimization problems typically rely on relaxations of nonlinear constraints over specific mathematical primitives. This is restricting in applications with constraints that are black-box, implicit or consist of more general primitives. Trying to address such limitations, Bertsimas and Ozturk (2023) proposed OCTHaGOn as a way of solving black-box global optimization problems by approximating the nonlinear constraints using hyperplane-based Decision-Trees and then using those trees to construct a unified mixed integer optimization (MIO) approximation of the original problem. We provide extensions to this approach, by (i) approximating the original problem using other MIO-representable ML models besides Decision Trees, such as Gradient Boosted Trees, Multi Layer Perceptrons and Suport Vector Machines, (ii) proposing adaptive sampling procedures for more accurate machine learning-based constraint approximations, (iii) utilizing robust optimization to account for the uncertainty of the sample-dependent training of the ML models, and (iv) leveraging a family of relaxations to address the infeasibilities of the final MIO approximation. We then test the enhanced framework in 81 Global Optimization instances. We show improvements in solution feasibility and optimality in the majority of instances. We also compare against BARON, showing improved optimality gaps or solution times in 11 instances.
    Spectral Clustering of Attributed Multi-relational Graphs. (arXiv:2311.01840v1 [cs.LG])
    Graph clustering aims at discovering a natural grouping of the nodes such that similar nodes are assigned to a common cluster. Many different algorithms have been proposed in the literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. However, complex data in many domains can be represented as both attributed and multi-relational networks. In this paper, we propose SpectralMix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. SpectralMix integrates all information available from the attributes, the different types of relations, and the graph structure to enable a sound interpretation of the clustering results. Moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to a single graph and to homogeneity analysis when applied to categorical data. Experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit the superiority of SpectralMix over existing methods.
    TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices. (arXiv:2311.01759v1 [cs.LG])
    Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformers on MCUs. TinyFormer mainly consists of SuperNAS, SparseNAS and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path model including transformer architecture from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse models with transformer on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can develop efficient transformers with an accuracy of $96.1\%$ while adhering to hardware constraints of $1$MB storage and $320$KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to $12.2\times$, when compared to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and greatly expand the scope of deep learning applications.
    Minimax Quasi-Bayesian estimation in sparse canonical correlation analysis via a Rayleigh quotient function. (arXiv:2010.08627v3 [stat.ML] UPDATED)
    Canonical correlation analysis (CCA) is a popular statistical technique for exploring relationships between datasets. In recent years, the estimation of sparse canonical vectors has emerged as an important but challenging variant of the CCA problem, with widespread applications. Unfortunately, existing rate-optimal estimators for sparse canonical vectors have high computational cost. We propose a quasi-Bayesian estimation procedure that not only achieves the minimax estimation rate, but also is easy to compute by Markov Chain Monte Carlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled Rayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et al. (2018), we adopt a Bayesian framework that combines this quasi-log-likelihood with a spike-and-slab prior to regularize the inference and promote sparsity. We investigate the empirical behavior of the proposed method on both continuous and truncated data, and we demonstrate that it outperforms several state-of-the-art methods. As an application, we use the proposed methodology to maximally correlate clinical variables and proteomic data for better understanding the Covid-19 disease.
    Why think step by step? Reasoning emerges from the locality of experience. (arXiv:2304.03843v3 [cs.AI] UPDATED)
    Humans have a powerful and mysterious capacity to reason. Working through a set of mental steps enables us to make inferences we would not be capable of making directly even though we get no additional data from the world. Similarly, when large language models generate intermediate steps (a chain of thought) before answering a question, they often produce better answers than they would directly. We investigate why and how chain-of-thought reasoning is useful in language models, testing the hypothesis that reasoning is effective when training data consists of overlapping local clusters of variables that influence each other strongly. These training conditions enable the chaining of accurate local inferences to estimate relationships between variables that were not seen together in training. We prove that there will exist a "reasoning gap", where reasoning through intermediate variables reduces bias, for the simple case of an autoregressive density estimator trained on local samples from a chain-structured probabilistic model. We then test our hypothesis experimentally in more complex models, training an autoregressive language model on samples from Bayes nets but only including a subset of variables in each sample. We test language models' ability to match conditional probabilities with and without intermediate reasoning steps, finding that intermediate steps are only helpful when the training data is locally structured with respect to dependencies between variables. The combination of locally structured observations and reasoning is much more data-efficient than training on all variables. Our results illustrate how the effectiveness of reasoning step by step is rooted in the local statistical structure of the training data.
    Enhancing Functional Data Analysis with Sequential Neural Networks: Advantages and Comparative Study. (arXiv:2311.01875v1 [cs.LG])
    Functional Data Analysis (FDA) is a statistical domain developed to handle functional data characterized by high dimensionality and complex data structures. Sequential Neural Networks (SNNs) are specialized neural networks capable of processing sequence data, a fundamental aspect of functional data. Despite their great flexibility in modeling functional data, SNNs have been inadequately employed in the FDA community. One notable advantage of SNNs is the ease of implementation, making them accessible to a broad audience beyond academia. Conversely, FDA-based methodologies present challenges, particularly for practitioners outside the field, due to their intricate complexity. In light of this, we propose utilizing SNNs in FDA applications and demonstrate their effectiveness through comparative analyses against popular FDA regression models based on numerical experiments and real-world data analysis. SNN architectures allow us to surpass the limitations of traditional FDA methods, offering scalability, flexibility, and improved analytical performance. Our findings highlight the potential of SNN-based methodologies as powerful tools for data applications involving functional data.
    High Precision Causal Model Evaluation with Conditional Randomization. (arXiv:2311.01902v1 [cs.LG])
    The gold standard for causal model evaluation involves comparing model predictions with true effects estimated from randomized controlled trials (RCT). However, RCTs are not always feasible or ethical to perform. In contrast, conditionally randomized experiments based on inverse probability weighting (IPW) offer a more realistic approach but may suffer from high estimation variance. To tackle this challenge and enhance causal model evaluation in real-world conditional randomization settings, we introduce a novel low-variance estimator for causal error, dubbed as the pairs estimator. By applying the same IPW estimator to both the model and true experimental effects, our estimator effectively cancels out the variance due to IPW and achieves a smaller asymptotic variance. Empirical studies demonstrate the improved of our estimator, highlighting its potential on achieving near-RCT performance. Our method offers a simple yet powerful solution to evaluate causal inference models in conditional randomization settings without complicated modification of the IPW estimator itself, paving the way for more robust and reliable model assessments.
    Conditions on Preference Relations that Guarantee the Existence of Optimal Policies. (arXiv:2311.01990v1 [cs.LG])
    Learning from Preferential Feedback (LfPF) plays an essential role in training Large Language Models, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. Using the von Neumann-Morgenstern Expected Utility Theorem, we show that the Direct Preference Process generalizes the standard reinforcement learning problem. Our findings narrow the gap between the empirical success and theoretical understanding of LfPF algorithms and provide future practitioners with the tools necessary for a more principled design of LfPF agents.
    CheX-Nomaly: Segmenting Lung Abnormalities from Chest Radiographs using Machine Learning. (arXiv:2311.01777v1 [eess.IV])
    The global challenge in chest radiograph X-ray (CXR) abnormalities often being misdiagnosed is primarily associated with perceptual errors, where healthcare providers struggle to accurately identify the location of abnormalities, rather than misclassification errors. We currently address this problem through disease-specific segmentation models. Unfortunately, these models cannot be released in the field due to their lack of generalizability across all thoracic diseases. A binary model tends to perform poorly when it encounters a disease that isn't represented in the dataset. We present CheX-nomaly: a binary localization U-net model that leverages transfer learning techniques with the incorporation of an innovative contrastive learning approach. Trained on the VinDr-CXR dataset, which encompasses 14 distinct diseases in addition to 'no finding' cases, my model achieves generalizability across these 14 diseases and others it has not seen before. We show that we can significantly improve the generalizability of an abnormality localization model by incorporating a contrastive learning method and dissociating the bounding boxes with its disease class. We also introduce a new loss technique to apply to enhance the U-nets performance on bounding box segmentation. By introducing CheX-nomaly, we offer a promising solution to enhance the precision of chest disease diagnosis, with a specific focus on reducing the significant number of perceptual errors in healthcare.
    ForecastPFN: Synthetically-Trained Zero-Shot Forecasting. (arXiv:2311.01933v1 [cs.LG])
    The vast majority of time-series forecasting approaches require a substantial training dataset. However, many real-life forecasting applications have very little initial observations, sometimes just 40 or fewer. Thus, the applicability of most forecasting methods is restricted in data-sparse commercial applications. While there is recent work in the setting of very limited initial data (so-called `zero-shot' forecasting), its performance is inconsistent depending on the data used for pretraining. In this work, we take a different approach and devise ForecastPFN, the first zero-shot forecasting model trained purely on a novel synthetic data distribution. ForecastPFN is a prior-data fitted network, trained to approximate Bayesian inference, which can make predictions on a new time series dataset in a single forward pass. Through extensive experiments, we show that zero-shot predictions made by ForecastPFN are more accurate and faster compared to state-of-the-art forecasting methods, even when the other methods are allowed to train on hundreds of additional in-distribution data points.
    Obtaining Explainable Classification Models using Distributionally Robust Optimization. (arXiv:2311.01994v1 [stat.ML])
    Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which can capture nonlinear dependencies and interactions. An inherent trade-off exists between rule set sparsity and its prediction accuracy. It is computationally expensive to find the right choice of sparsity -- e.g., via cross-validation -- with existing methods. We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors. Good generalization is ensured while keeping computational costs low by utilizing distributionally robust optimization. The formulation utilizes column generation to efficiently search the space of rule sets and constructs a sparse ensemble of rule sets, in contrast with techniques like random forests or boosting and their variants. We present theoretical results that motivate and justify the use of our distributionally robust formulation. Extensive numerical experiments establish that our method improves over competing methods -- on a large set of publicly available binary classification problem instances -- with respect to one or more of the following metrics: generalization quality, computational cost, and explainability.
    Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits. (arXiv:2007.04750v2 [cs.LG] UPDATED)
    An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.
    Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond. (arXiv:2209.06177v3 [cs.SI] UPDATED)
    Homophily is a graph property describing the tendency of edges to connect similar nodes; the opposite is called heterophily. It is often believed that heterophilous graphs are challenging for standard message-passing graph neural networks (GNNs), and much effort has been put into developing efficient methods for this setting. However, there is no universally agreed-upon measure of homophily in the literature. In this work, we show that commonly used homophily measures have critical drawbacks preventing the comparison of homophily levels across different datasets. For this, we formalize desirable properties for a proper homophily measure and verify which measures satisfy which properties. In particular, we show that a measure that we call adjusted homophily satisfies more desirable properties than other popular homophily measures while being rarely used in graph machine learning literature. Then, we go beyond the homophily-heterophily dichotomy and propose a new characteristic that allows one to further distinguish different sorts of heterophily. The proposed label informativeness (LI) characterizes how much information a neighbor's label provides about a node's label. We prove that this measure satisfies important desirable properties. We also observe empirically that LI better agrees with GNN performance compared to homophily measures, which confirms that it is a useful characteristic of the graph structure.
    Mix-ME: Quality-Diversity for Multi-Agent Learning. (arXiv:2311.01829v1 [cs.LG])
    In many real-world systems, such as adaptive robotics, achieving a single, optimised solution may be insufficient. Instead, a diverse set of high-performing solutions is often required to adapt to varying contexts and requirements. This is the realm of Quality-Diversity (QD), which aims to discover a collection of high-performing solutions, each with their own unique characteristics. QD methods have recently seen success in many domains, including robotics, where they have been used to discover damage-adaptive locomotion controllers. However, most existing work has focused on single-agent settings, despite many tasks of interest being multi-agent. To this end, we introduce Mix-ME, a novel multi-agent variant of the popular MAP-Elites algorithm that forms new solutions using a crossover-like operator by mixing together agents from different teams. We evaluate the proposed methods on a variety of partially observable continuous control tasks. Our evaluation shows that these multi-agent variants obtained by Mix-ME not only compete with single-agent baselines but also often outperform them in multi-agent settings under partial observability.
    High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise. (arXiv:2311.02000v1 [math.OC])
    In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.
    A Variational Perspective on High-Resolution ODEs. (arXiv:2311.02002v1 [math.OC])
    We consider unconstrained minimization of smooth convex functions. We propose a novel variational perspective using forced Euler-Lagrange equation that allows for studying high-resolution ODEs. Through this, we obtain a faster convergence rate for gradient norm minimization using Nesterov's accelerated gradient method. Additionally, we show that Nesterov's method can be interpreted as a rate-matching discretization of an appropriately chosen high-resolution ODE. Finally, using the results from the new variational perspective, we propose a stochastic method for noisy gradients. Several numerical experiments compare and illustrate our stochastic algorithm with state of the art methods.
    Heterogeneous federated collaborative filtering using FAIR: Federated Averaging in Random Subspaces. (arXiv:2311.01722v1 [cs.LG])
    Recommendation systems (RS) for items (e.g., movies, books) and ads are widely used to tailor content to users on various internet platforms. Traditionally, recommendation models are trained on a central server. However, due to rising concerns for data privacy and regulations like the GDPR, federated learning is an increasingly popular paradigm in which data never leaves the client device. Applying federated learning to recommendation models is non-trivial due to large embedding tables, which often exceed the memory constraints of most user devices. To include data from all devices in federated learning, we must enable collective training of embedding tables on devices with heterogeneous memory capacities. Current solutions to heterogeneous federated learning can only accommodate a small range of capacities and thus limit the number of devices that can participate in training. We present Federated Averaging in Random subspaces (FAIR), which allows arbitrary compression of embedding tables based on device capacity and ensures the participation of all devices in training. FAIR uses what we call consistent and collapsible subspaces defined by hashing-based random projections to jointly train large embedding tables while using varying amounts of compression on user devices. We evaluate FAIR on Neural Collaborative Filtering tasks with multiple datasets and verify that FAIR can gather and share information from a wide range of devices with varying capacities, allowing for seamless collaboration. We prove the convergence of FAIR in the homogeneous setting with non-i.i.d data distribution. Our code is open source at {https://github.com/apd10/FLCF}
    SEGA: Instructing Text-to-Image Models using Semantic Guidance. (arXiv:2301.12247v2 [cs.CV] UPDATED)
    Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.
    OpenAGI: When LLM Meets Domain Experts. (arXiv:2304.04370v6 [cs.AI] UPDATED)
    Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI Agents, enabling them to harness expert models for complex task-solving towards Artificial General Intelligence (AGI). Large Language Models (LLMs) show promising learning and reasoning abilities, and can effectively use external models, tools, plugins, or APIs to tackle complex problems. In this work, we introduce OpenAGI, an open-source AGI research and development platform designed for solving multi-step, real-world tasks. Specifically, OpenAGI uses a dual strategy, integrating standard benchmark tasks for benchmarking and evaluation, and open-ended tasks including more expandable models, tools, plugins, or APIs for creative problem-solving. Tasks are presented as natural language queries to the LLM, which then selects and executes appropriate models. We also propose a Reinforcement Learning from Task Feedback (RLTF) mechanism that uses task results to improve the LLM's task-solving ability, which creates a self-improving AI feedback loop. While we acknowledge that AGI is a broad and multifaceted research challenge with no singularly defined solution path, the integration of LLMs with domain-specific expert models, inspired by mirroring the blend of general and specialized intelligence in humans, offers a promising approach towards AGI. We are open-sourcing the OpenAGI project's code, dataset, benchmarks, evaluation methods, and the UI demo to foster community involvement in AGI advancement: https://github.com/agiresearch/OpenAGI.
    A minimax optimal control approach for robust neural ODEs. (arXiv:2310.17584v2 [math.OC] UPDATED)
    In this paper, we address the adversarial training of neural ODEs from a robust control perspective. This is an alternative to the classical training via empirical risk minimization, and it is widely used to enforce reliable outcomes for input perturbations. Neural ODEs allow the interpretation of deep neural networks as discretizations of control systems, unlocking powerful tools from control theory for the development and the understanding of machine learning. In this specific case, we formulate the adversarial training with perturbed data as a minimax optimal control problem, for which we derive first order optimality conditions in the form of Pontryagin's Maximum Principle. We provide a novel interpretation of robust training leading to an alternative weighted technique, which we test on a low-dimensional classification task.
    Applications of the Theory of Aggregated Markov Processes in Stochastic Learning Theory. (arXiv:2311.01476v1 [stat.ML])
    A stochastic process that arises by composing a function with a Markov process is called an aggregated Markov process (AMP). The purpose of composing a Markov process with a function can be a reduction of dimensions, e.g., a projection onto certain coordinates. The theory around AMP has been extensively studied e.g. by Dynkin, Cameron, Rogers and Pitman, and Kelly, all of whom provided sufficient conditions for an AMP to remain Markov. In another direction, Larget provided a canonical representation for AMP, which can be used to verify the equivalence of two AMPs. The purpose of this paper is to describe how the theory of AMP can be applied to stochastic learning theory as they learn a particular task.
    Adversarial Attacks on Cooperative Multi-agent Bandits. (arXiv:2311.01698v1 [cs.LG])
    Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm $T-o(T)$ times while incurring $o(T)$ attack costs in $T$ rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.
    Energy Efficiency Optimization for Subterranean LoRaWAN Using A Reinforcement Learning Approach: A Direct-to-Satellite Scenario. (arXiv:2311.01743v1 [cs.IT])
    The integration of subterranean LoRaWAN and non-terrestrial networks (NTN) delivers substantial economic and societal benefits in remote agriculture and disaster rescue operations. The LoRa modulation leverages quasi-orthogonal spreading factors (SFs) to optimize data rates, airtime, coverage and energy consumption. However, it is still challenging to effectively assign SFs to end devices for minimizing co-SF interference in massive subterranean LoRaWAN NTN. To address this, we investigate a reinforcement learning (RL)-based SFs allocation scheme to optimize the system's energy efficiency (EE). To efficiently capture the device-to-environment interactions in dense networks, we proposed an SFs allocation technique using the multi-agent dueling double deep Q-network (MAD3QN) and the multi-agent advantage actor-critic (MAA2C) algorithms based on an analytical reward mechanism. Our proposed RL-based SFs allocation approach evinces better performance compared to four benchmarks in the extreme underground direct-to-satellite scenario. Remarkably, MAD3QN shows promising potentials in surpassing MAA2C in terms of convergence rate and EE.
    Simplifying Transformer Blocks. (arXiv:2311.01906v1 [cs.LG])
    A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.
    Faithful and Robust Local Interpretability for Textual Predictions. (arXiv:2311.01605v1 [cs.CL])
    Interpretability is essential for machine learning models to be trusted and deployed in critical domains. However, existing methods for interpreting text models are often complex, lack solid mathematical foundations, and their performance is not guaranteed. In this paper, we propose FRED (Faithful and Robust Explainer for textual Documents), a novel method for interpreting predictions over text. FRED identifies key words in a document that significantly impact the prediction when removed. We establish the reliability of FRED through formal definitions and theoretical analyses on interpretable classifiers. Additionally, our empirical evaluation against state-of-the-art methods demonstrates the effectiveness of FRED in providing insights into text models.
    Physics-Informed Generator-Encoder Adversarial Networks with Latent Space Matching for Stochastic Differential Equations. (arXiv:2311.01708v1 [cs.LG])
    We propose a new class of physics-informed neural networks, called Physics-Informed Generator-Encoder Adversarial Networks, to effectively address the challenges posed by forward, inverse, and mixed problems in stochastic differential equations. In these scenarios, while the governing equations are known, the available data consist of only a limited set of snapshots for system parameters. Our model consists of two key components: the generator and the encoder, both updated alternately by gradient descent. In contrast to previous approaches of directly matching the approximated solutions with real snapshots, we employ an indirect matching that operates within the lower-dimensional latent feature space. This method circumvents challenges associated with high-dimensional inputs and complex data distributions, while yielding more accurate solutions compared to existing neural network solvers. In addition, the approach also mitigates the training instability issues encountered in previous adversarial frameworks in an efficient manner. Numerical results provide compelling evidence of the effectiveness of the proposed method in solving different types of stochastic differential equations.
    Patch-Based Deep Unsupervised Image Segmentation using Graph Cuts. (arXiv:2311.01475v1 [cs.CV])
    Unsupervised image segmentation aims at grouping different semantic patterns in an image without the use of human annotation. Similarly, image clustering searches for groupings of images based on their semantic content without supervision. Classically, both problems have captivated researchers as they drew from sound mathematical concepts to produce concrete applications. With the emergence of deep learning, the scientific community turned its attention to complex neural network-based solvers that achieved impressive results in those domains but rarely leveraged the advances made by classical methods. In this work, we propose a patch-based unsupervised image segmentation strategy that bridges advances in unsupervised feature extraction from deep clustering methods with the algorithmic help of classical graph-based methods. We show that a simple convolutional neural network, trained to classify image patches and iteratively regularized using graph cuts, naturally leads to a state-of-the-art fully-convolutional unsupervised pixel-level segmenter. Furthermore, we demonstrate that this is the ideal setting for leveraging the patch-level pairwise features generated by vision transformer models. Our results on real image data demonstrate the effectiveness of our proposed methodology.
    Invariant Causal Imitation Learning for Generalizable Policies. (arXiv:2311.01489v1 [stat.ML])
    Consider learning an imitation policy on the basis of demonstrated behavior from multiple environments, with an eye towards deployment in an unseen environment. Since the observable features from each setting may be different, directly learning individual policies as mappings from features to actions is prone to spurious correlations -- and may not generalize well. However, the expert's policy is often a function of a shared latent structure underlying those observable features that is invariant across settings. By leveraging data from multiple environments, we propose Invariant Causal Imitation Learning (ICIL), a novel technique in which we learn a feature representation that is invariant across domains, on the basis of which we learn an imitation policy that matches expert behavior. To cope with transition dynamics mismatch, ICIL learns a shared representation of causal features (for all training environments), that is disentangled from the specific representations of noise variables (for each of those environments). Moreover, to ensure that the learned policy matches the observation distribution of the expert's policy, ICIL estimates the energy of the expert's observations and uses a regularization term that minimizes the imitator policy's next state energy. Experimentally, we compare our methods against several benchmarks in control and healthcare tasks and show its effectiveness in learning imitation policies capable of generalizing to unseen environments.
    Detecting Spurious Correlations via Robust Visual Concepts in Real and AI-Generated Image Classification. (arXiv:2311.01655v1 [cs.LG])
    Often machine learning models tend to automatically learn associations present in the training data without questioning their validity or appropriateness. This undesirable property is the root cause of the manifestation of spurious correlations, which render models unreliable and prone to failure in the presence of distribution shifts. Research shows that most methods attempting to remedy spurious correlations are only effective for a model's known spurious associations. Current spurious correlation detection algorithms either rely on extensive human annotations or are too restrictive in their formulation. Moreover, they rely on strict definitions of visual artifacts that may not apply to data produced by generative models, as they are known to hallucinate contents that do not conform to standard specifications. In this work, we introduce a general-purpose method that efficiently detects potential spurious correlations, and requires significantly less human interference in comparison to the prior art. Additionally, the proposed method provides intuitive explanations while eliminating the need for pixel-level annotations. We demonstrate the proposed method's tolerance to the peculiarity of AI-generated images, which is a considerably challenging task, one where most of the existing methods fall short. Consequently, our method is also suitable for detecting spurious correlations that may propagate to downstream applications originating from generative models.
    Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI. (arXiv:2311.01463v1 [cs.CL])
    Large language models have proliferated across multiple domains in as short period of time. There is however hesitation in the medical and healthcare domain towards their adoption because of issues like factuality, coherence, and hallucinations. Give the high stakes nature of healthcare, many researchers have even cautioned against its usage until these issues are resolved. The key to the implementation and deployment of LLMs in healthcare is to make these models trustworthy, transparent (as much possible) and explainable. In this paper we describe the key elements in creating reliable, trustworthy, and unbiased models as a necessary condition for their adoption in healthcare. Specifically we focus on the quantification, validation, and mitigation of hallucinations in the context in healthcare. Lastly, we discuss how the future of LLMs in healthcare may look like.
    Leveraging Language Models to Detect Greenwashing. (arXiv:2311.01469v1 [cs.CL])
    In recent years, climate change repercussions have increasingly captured public interest. Consequently, corporations are emphasizing their environmental efforts in sustainability reports to bolster their public image. Yet, the absence of stringent regulations in review of such reports allows potential greenwashing. In this study, we introduce a novel methodology to train a language model on generated labels for greenwashing risk. Our primary contributions encompass: developing a mathematical formulation to quantify greenwashing risk, a fine-tuned ClimateBERT model for this problem, and a comparative analysis of results. On a test set comprising of sustainability reports, our best model achieved an average accuracy score of 86.34% and F1 score of 0.67, demonstrating that our methods show a promising direction of exploration for this task.
    Look-Ahead Selective Plasticity for Continual Learning of Visual Tasks. (arXiv:2311.01617v1 [cs.CV])
    Contrastive representation learning has emerged as a promising technique for continual learning as it can learn representations that are robust to catastrophic forgetting and generalize well to unseen future tasks. Previous work in continual learning has addressed forgetting by using previous task data and trained models. Inspired by event models created and updated in the brain, we propose a new mechanism that takes place during task boundaries, i.e., when one task finishes and another starts. By observing the redundancy-inducing ability of contrastive loss on the output of a neural network, our method leverages the first few samples of the new task to identify and retain parameters contributing most to the transfer ability of the neural network, freeing up the remaining parts of the network to learn new features. We evaluate the proposed methods on benchmark computer vision datasets including CIFAR10 and TinyImagenet and demonstrate state-of-the-art performance in the task-incremental, class-incremental, and domain-incremental continual learning scenarios.
    Epidemic Decision-making System Based Federated Reinforcement Learning. (arXiv:2311.01749v1 [cs.LG])
    Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Epidemic decision-making can effectively help the government to comprehensively consider public security and economic development to respond to public health and safety emergencies. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. Some studies have shown that intensive learning can effectively help the government to make epidemic decision, thus achieving the balance between health security and economic development. However, epidemic data often has the characteristics of limited samples and high privacy. However, epidemic data often has the characteristics of limited samples and high privacy. This model can combine the epidemic situation data of various provinces for cooperative training to use as an enhanced learning model for epidemic situation decision, while protecting the privacy of data. The experiment shows that the enhanced federated learning can obtain more optimized performance and return than the enhanced learning, and the enhanced federated learning can also accelerate the training convergence speed of the training model. accelerate the training convergence speed of the client. At the same time, through the experimental comparison, A2C is the most suitable reinforcement learning model for the epidemic situation decision-making. learning model for the epidemic situation decision-making scenario, followed by the PPO model, and the performance of DDPG is unsatisfactory.
    Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization. (arXiv:2311.01544v1 [cs.CL])
    Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. Their ever-increasing size, however, raised concerns about their effective deployment and the need for LLM compressions. This study introduces the Divergent Token metrics (DTMs), a novel approach for assessing compressed LLMs, addressing the limitations of traditional measures like perplexity that fail to accurately reflect text generation quality. DTMs focus on token divergence, providing deeper insights into the subtleties of model compression. Our results indicate that significant levels of precision and sparsity can be achieved without compromising text generation quality. Moreover, DTMs offers a more precise evaluation of each component's impact individually. Utilizing the First Divergent Token metric (FDTM) in model sparsification reveals that nearly 20% of all components can be pruned over 90%. In terms of quantization, the FDTM suggests that over 80% of parameters can be straightforwardly transformed to int8 without special outlier management.
    Adversarial Examples in the Physical World: A Survey. (arXiv:2311.01473v1 [cs.CV])
    Deep neural networks (DNNs) have demonstrated high vulnerability to adversarial examples. Besides the attacks in the digital world, the practical implications of adversarial examples in the physical world present significant challenges and safety concerns. However, current research on physical adversarial examples (PAEs) lacks a comprehensive understanding of their unique characteristics, leading to limited significance and understanding. In this paper, we address this gap by thoroughly examining the characteristics of PAEs within a practical workflow encompassing training, manufacturing, and re-sampling processes. By analyzing the links between physical adversarial attacks, we identify manufacturing and re-sampling as the primary sources of distinct attributes and particularities in PAEs. Leveraging this knowledge, we develop a comprehensive analysis and classification framework for PAEs based on their specific characteristics, covering over 100 studies on physical-world adversarial examples. Furthermore, we investigate defense strategies against PAEs and identify open challenges and opportunities for future research. We aim to provide a fresh, thorough, and systematic understanding of PAEs, thereby promoting the development of robust adversarial learning and its application in open-world scenarios.
    E(2) Equivariant Neural Networks for Robust Galaxy Morphology Classification. (arXiv:2311.01500v1 [astro-ph.GA])
    We propose the use of group convolutional neural network architectures (GCNNs) equivariant to the 2D Euclidean group, $E(2)$, for the task of galaxy morphology classification by utilizing symmetries of the data present in galaxy images as an inductive bias in the architecture. We conduct robustness studies by introducing artificial perturbations via Poisson noise insertion and one-pixel adversarial attacks to simulate the effects of limited observational capabilities. We train, validate, and test GCNNs equivariant to discrete subgroups of $E(2)$ - the cyclic and dihedral groups of order $N$ - on the Galaxy10 DECals dataset and find that GCNNs achieve higher classification accuracy and are consistently more robust than their non-equivariant counterparts, with an architecture equivariant to the group $D_{16}$ achieving a $95.52 \pm 0.18\%$ test-set accuracy. We also find that the model loses $<6\%$ accuracy on a $50\%$-noise dataset and all GCNNs are less susceptible to one-pixel perturbations than an identically constructed CNN. Our code is publicly available at https://github.com/snehjp2/GCNNMorphology.
    SortNet: Learning To Rank By a Neural-Based Sorting Algorithm. (arXiv:2311.01864v1 [cs.LG])
    The problem of relevance ranking consists of sorting a set of objects with respect to a given criterion. Since users may prefer different relevance criteria, the ranking algorithms should be adaptable to the user needs. Two main approaches exist in literature for the task of learning to rank: 1) a score function, learned by examples, which evaluates the properties of each object yielding an absolute relevance value that can be used to order the objects or 2) a pairwise approach, where a "preference function" is learned using pairs of objects to define which one has to be ranked first. In this paper, we present SortNet, an adaptive ranking algorithm which orders objects using a neural network as a comparator. The neural network training set provides examples of the desired ordering between pairs of items and it is constructed by an iterative procedure which, at each iteration, adds the most informative training examples. Moreover, the comparator adopts a connectionist architecture that is particularly suited for implementing a preference function. We also prove that such an architecture has the universal approximation property and can implement a wide class of functions. Finally, the proposed algorithm is evaluated on the LETOR dataset showing promising performances in comparison with other state of the art algorithms.
    Flexible Error Mitigation of Quantum Processes with Data Augmentation Empowered Neural Model. (arXiv:2311.01727v1 [quant-ph])
    Neural networks have shown their effectiveness in various tasks in the realm of quantum computing. However, their application in quantum error mitigation, a crucial step towards realizing practical quantum advancements, has been restricted by reliance on noise-free statistics. To tackle this critical challenge, we propose a data augmentation empowered neural model for error mitigation (DAEM). Our model does not require any prior knowledge about the specific noise type and measurement settings and can estimate noise-free statistics solely from the noisy measurement results of the target quantum process, rendering it highly suitable for practical implementation. In numerical experiments, we show the model's superior performance in mitigating various types of noise, including Markovian noise and Non-Markovian noise, compared with previous error mitigation methods. We further demonstrate its versatility by employing the model to mitigate errors in diverse types of quantum processes, including those involving large-scale quantum systems and continuous-variable quantum states. This powerful data augmentation-empowered neural model for error mitigation establishes a solid foundation for realizing more reliable and robust quantum technologies in practical applications.
    Calibrate and Boost Logical Expressiveness of GNN Over Multi-Relational and Temporal Graphs. (arXiv:2311.01647v1 [cs.LG])
    As a powerful framework for graph representation learning, Graph Neural Networks (GNNs) have garnered significant attention in recent years. However, to the best of our knowledge, there has been no formal analysis of the logical expressiveness of GNNs as Boolean node classifiers over multi-relational graphs, where each edge carries a specific relation type. In this paper, we investigate $\mathcal{FOC}_2$, a fragment of first-order logic with two variables and counting quantifiers. On the negative side, we demonstrate that the R$^2$-GNN architecture, which extends the local message passing GNN by incorporating global readout, fails to capture $\mathcal{FOC}_2$ classifiers in the general case. Nevertheless, on the positive side, we establish that R$^2$-GNNs models are equivalent to $\mathcal{FOC}_2$ classifiers under certain restricted yet reasonable scenarios. To address the limitations of R$^2$-GNNs regarding expressiveness, we propose a simple graph transformation technique, akin to a preprocessing step, which can be executed in linear time. This transformation enables R$^2$-GNNs to effectively capture any $\mathcal{FOC}_2$ classifiers when applied to the "transformed" input graph. Moreover, we extend our analysis of expressiveness and graph transformation to temporal graphs, exploring several temporal GNN architectures and providing an expressiveness hierarchy for them. To validate our findings, we implement R$^2$-GNNs and the graph transformation technique and conduct empirical tests in node classification tasks against various well-known GNN architectures that support multi-relational or temporal graphs. Our experimental results consistently demonstrate that R$^2$-GNN with the graph transformation outperforms the baseline methods on both synthetic and real-world datasets
    Domain Adaptive Graph Neural Networks for Constraining Cosmological Parameters Across Multiple Data Sets. (arXiv:2311.01588v1 [astro-ph.CO])
    Deep learning models have been shown to outperform methods that rely on summary statistics, like the power spectrum, in extracting information from complex cosmological data sets. However, due to differences in the subgrid physics implementation and numerical approximations across different simulation suites, models trained on data from one cosmological simulation show a drop in performance when tested on another. Similarly, models trained on any of the simulations would also likely experience a drop in performance when applied to observational data. Training on data from two different suites of the CAMELS hydrodynamic cosmological simulations, we examine the generalization capabilities of Domain Adaptive Graph Neural Networks (DA-GNNs). By utilizing GNNs, we capitalize on their capacity to capture structured scale-free cosmological information from galaxy distributions. Moreover, by including unsupervised domain adaptation via Maximum Mean Discrepancy (MMD), we enable our models to extract domain-invariant features. We demonstrate that DA-GNN achieves higher accuracy and robustness on cross-dataset tasks (up to $28\%$ better relative error and up to almost an order of magnitude better $\chi^2$). Using data visualizations, we show the effects of domain adaptation on proper latent space data alignment. This shows that DA-GNNs are a promising method for extracting domain-independent cosmological information, a vital step toward robust deep learning for real cosmic survey data.
    Amide Proton Transfer (APT) imaging in tumor with a machine learning approach using partially synthetic data. (arXiv:2311.01683v1 [physics.med-ph])
    Machine learning (ML) has been increasingly used to quantify chemical exchange saturation transfer (CEST) effect. ML models are typically trained using either measured data or fully simulated data. However, training with measured data often lacks sufficient training data, while training with fully simulated data may introduce bias due to limited simulations pools. This study introduces a new platform that combines simulated and measured components to generate partially synthetic CEST data, and to evaluate its feasibility for training ML models to predict amide proton transfer (APT) effect. Partially synthetic CEST signals were created using an inverse summation of APT effects from simulations and the other components from measurements. Training data were generated by varying APT simulation parameters and applying scaling factors to adjust the measured components, achieving a balance between simulation flexibility and fidelity. First, tissue-mimicking CEST signals along with ground truth information were created using multiple-pool model simulations to validate this method. Second, an ML model was trained individually on partially synthetic data, in vivo data, and fully simulated data, to predict APT effect in rat brains bearing 9L tumors. Experiments on tissue-mimicking data suggest that the ML method using the partially synthetic data is accurate in predicting APT. In vivo experiments suggest that our method provides more accurate and robust prediction than the training using in vivo data and fully synthetic data. Partially synthetic CEST data can address the challenges in conventional ML methods.
    SemiGPC: Distribution-Aware Label Refinement for Imbalanced Semi-Supervised Learning Using Gaussian Processes. (arXiv:2311.01646v1 [cs.CV])
    In this paper we introduce SemiGPC, a distribution-aware label refinement strategy based on Gaussian Processes where the predictions of the model are derived from the labels posterior distribution. Differently from other buffer-based semi-supervised methods such as CoMatch and SimMatch, our SemiGPC includes a normalization term that addresses imbalances in the global data distribution while maintaining local sensitivity. This explicit control allows SemiGPC to be more robust to confirmation bias especially under class imbalance. We show that SemiGPC improves performance when paired with different Semi-Supervised methods such as FixMatch, ReMixMatch, SimMatch and FreeMatch and different pre-training strategies including MSN and Dino. We also show that SemiGPC achieves state of the art results under different degrees of class imbalance on standard CIFAR10-LT/CIFAR100-LT especially in the low data-regime. Using SemiGPC also results in about 2% avg.accuracy increase compared to a new competitive baseline on the more challenging benchmarks SemiAves, SemiCUB, SemiFungi and Semi-iNat.
    ATGNN: Audio Tagging Graph Neural Network. (arXiv:2311.01526v1 [cs.SD])
    Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as a sequence of patches which is not flexible enough to capture irregular audio objects. In this work, we treat the spectrogram in a more flexible way by considering it as graph structure and process it with a novel graph neural architecture called ATGNN. ATGNN not only combines the capability of CNNs with the global information sharing ability of Graph Neural Networks, but also maps semantic relationships between learnable class embeddings and corresponding spectrogram regions. We evaluate ATGNN on two audio tagging tasks, where it achieves 0.585 mAP on the FSD50K dataset and 0.335 mAP on the AudioSet-balanced dataset, achieving comparable results to Transformer based models with significantly lower number of learnable parameters.
    Investigating the Behavior of Diffusion Models for Accelerating Electronic Structure Calculations. (arXiv:2311.01491v1 [physics.chem-ph])
    We present an investigation into diffusion models for molecular generation, with the aim of better understanding how their predictions compare to the results of physics-based calculations. The investigation into these models is driven by their potential to significantly accelerate electronic structure calculations using machine learning, without requiring expensive first-principles datasets for training interatomic potentials. We find that the inference process of a popular diffusion model for de novo molecular generation is divided into an exploration phase, where the model chooses the atomic species, and a relaxation phase, where it adjusts the atomic coordinates to find a low-energy geometry. As training proceeds, we show that the model initially learns about the first-order structure of the potential energy surface, and then later learns about higher-order structure. We also find that the relaxation phase of the diffusion model can be re-purposed to sample the Boltzmann distribution over conformations and to carry out structure relaxations. For structure relaxations, the model finds geometries with ~10x lower energy than those produced by a classical force field for small organic molecules. Initializing a density functional theory (DFT) relaxation at the diffusion-produced structures yields a >2x speedup to the DFT relaxation when compared to initializing at structures relaxed with a classical force field.
    An Efficient Detection and Control System for Underwater Docking using Machine Learning and Realistic Simulation: A Comprehensive Approach. (arXiv:2311.01522v1 [cs.RO])
    Underwater docking is critical to enable the persistent operation of Autonomous Underwater Vehicles (AUVs). For this, the AUV must be capable of detecting and localizing the docking station, which is complex due to the highly dynamic undersea environment. Image-based solutions offer a high acquisition rate and versatile alternative to adapt to this environment; however, the underwater environment presents challenges such as low visibility, high turbidity, and distortion. In addition to this, field experiments to validate underwater docking capabilities can be costly and dangerous due to the specialized equipment and safety considerations required to conduct the experiments. This work compares different deep-learning architectures to perform underwater docking detection and classification. The architecture with the best performance is then compressed using knowledge distillation under the teacher-student paradigm to reduce the network's memory footprint, allowing real-time implementation. To reduce the simulation-to-reality gap, a Generative Adversarial Network (GAN) is used to do image-to-image translation, converting the Gazebo simulation image into a realistic underwater-looking image. The obtained image is then processed using an underwater image formation model to simulate image attenuation over distance under different water types. The proposed method is finally evaluated according to the AUV docking success rate and compared with classical vision methods. The simulation results show an improvement of 20% in the high turbidity scenarios regardless of the underwater currents. Furthermore, we show the performance of the proposed approach by showing experimental results on the off-the-shelf AUV Iver3.
    Local Borsuk-Ulam, Stability, and Replicability. (arXiv:2311.01599v1 [cs.LG])
    We use and adapt the Borsuk-Ulam Theorem from topology to derive limitations on list-replicable and globally stable learning algorithms. We further demonstrate the applicability of our methods in combinatorics and topology. We show that, besides trivial cases, both list-replicable and globally stable learning are impossible in the agnostic PAC setting. This is in contrast with the realizable case where it is known that any class with a finite Littlestone dimension can be learned by such algorithms. In the realizable PAC setting, we sharpen previous impossibility results and broaden their scope. Specifically, we establish optimal bounds for list replicability and global stability numbers in finite classes. This provides an exponential improvement over previous works and implies an exponential separation from the Littlestone dimension. We further introduce lower bounds for weak learners, i.e., learners that are only marginally better than random guessing. Lower bounds from previous works apply only to stronger learners. To offer a broader and more comprehensive view of our topological approach, we prove a local variant of the Borsuk-Ulam theorem in topology and a result in combinatorics concerning Kneser colorings. In combinatorics, we prove that if $c$ is a coloring of all non-empty subsets of $[n]$ such that disjoint sets have different colors, then there is a chain of subsets that receives at least $1+ \lfloor n/2\rfloor$ colors (this bound is sharp). In topology, we prove e.g. that for any open antipodal-free cover of the $d$-dimensional sphere, there is a point $x$ that belongs to at least $t=\lceil\frac{d+3}{2}\rceil$ sets.
    High-performance real-world optical computing trained by in situ model-free optimization. (arXiv:2307.11957v3 [physics.optics] UPDATED)
    Optical computing systems can provide high-speed and low-energy data processing but face deficiencies in computationally demanding training and simulation-to-reality gap. We propose a model-free solution for lightweight in situ optimization of optical computing systems based on the score gradient estimation algorithm. This approach treats the system as a black box and back-propagates loss directly to the optical weights' probabilistic distributions, hence circumventing the need for computation-heavy and biased system simulation. We demonstrate a superior classification accuracy on the MNIST and FMNIST datasets through experiments on a single-layer diffractive optical computing system. Furthermore, we show its potential for image-free and high-speed cell analysis. The inherent simplicity of our proposed method, combined with its low demand for computational resources, expedites the transition of optical computing from laboratory demonstrations to real-world applications.  ( 2 min )
    From SMOTE to Mixup for Deep Imbalanced Classification. (arXiv:2308.15457v2 [cs.LG] UPDATED)
    Given imbalanced data, it is hard to train a good classifier using deep learning because of the poor generalization of minority classes. Traditionally, the well-known synthetic minority oversampling technique (SMOTE) for data augmentation, a data mining approach for imbalanced learning, has been used to improve this generalization. However, it is unclear whether SMOTE also benefits deep learning. In this work, we study why the original SMOTE is insufficient for deep learning, and enhance SMOTE using soft labels. Connecting the resulting soft SMOTE with Mixup, a modern data augmentation technique, leads to a unified framework that puts traditional and modern data augmentation techniques under the same umbrella. A careful study within this framework shows that Mixup improves generalization by implicitly achieving uneven margins between majority and minority classes. We then propose a novel margin-aware Mixup technique that more explicitly achieves uneven margins. Extensive experimental results demonstrate that our proposed technique yields state-of-the-art performance on deep imbalanced classification while achieving superior performance on extremely imbalanced data. The code is open-sourced in our developed package https://github.com/ntucllab/imbalanced-DL to foster future research in this direction.
    Landscape Surrogate: Learning Decision Losses for Mathematical Optimization Under Partial Information. (arXiv:2307.08964v2 [cs.LG] UPDATED)
    Recent works in learning-integrated optimization have shown promise in settings where the optimization problem is only partially observed or where general-purpose optimizers perform poorly without expert tuning. By learning an optimizer $\mathbf{g}$ to tackle these challenging problems with $f$ as the objective, the optimization process can be substantially accelerated by leveraging past experience. The optimizer can be trained with supervision from known optimal solutions or implicitly by optimizing the compound function $f\circ \mathbf{g}$. The implicit approach may not require optimal solutions as labels and is capable of handling problem uncertainty; however, it is slow to train and deploy due to frequent calls to optimizer $\mathbf{g}$ during both training and testing. The training is further challenged by sparse gradients of $\mathbf{g}$, especially for combinatorial solvers. To address these challenges, we propose using a smooth and learnable Landscape Surrogate $M$ as a replacement for $f\circ \mathbf{g}$. This surrogate, learnable by neural networks, can be computed faster than the solver $\mathbf{g}$, provides dense and smooth gradients during training, can generalize to unseen optimization problems, and is efficiently learned via alternating optimization. We test our approach on both synthetic problems, including shortest path and multidimensional knapsack, and real-world problems such as portfolio optimization, achieving comparable or superior objective values compared to state-of-the-art baselines while reducing the number of calls to $\mathbf{g}$. Notably, our approach outperforms existing methods for computationally expensive high-dimensional problems.  ( 3 min )
    Adaptive Algorithms for Relaxed Pareto Set Identification. (arXiv:2307.00424v2 [stat.ML] UPDATED)
    In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most $k$ Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.  ( 2 min )
    Deconstructing Data Reconstruction: Multiclass, Weight Decay and General Losses. (arXiv:2307.01827v2 [cs.LG] UPDATED)
    Memorization of training data is an active research area, yet our understanding of the inner workings of neural networks is still in its infancy. Recently, Haim et al. (2022) proposed a scheme to reconstruct training samples from multilayer perceptron binary classifiers, effectively demonstrating that a large portion of training samples are encoded in the parameters of such networks. In this work, we extend their findings in several directions, including reconstruction from multiclass and convolutional neural networks. We derive a more general reconstruction scheme which is applicable to a wider range of loss functions such as regression losses. Moreover, we study the various factors that contribute to networks' susceptibility to such reconstruction schemes. Intriguingly, we observe that using weight decay during training increases reconstructability both in terms of quantity and quality. Additionally, we examine the influence of the number of neurons relative to the number of training samples on the reconstructability. Code: https://github.com/gonbuzaglo/decoreco  ( 2 min )
    MARRS: Multimodal Reference Resolution System. (arXiv:2311.01650v1 [cs.CL])
    Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.
    An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient. (arXiv:2307.08873v3 [cs.LG] UPDATED)
    Restricting the variance of a policy's return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.  ( 2 min )
    Improving Interpersonal Communication by Simulating Audiences with Language Models. (arXiv:2311.00687v2 [cs.AI] UPDATED)
    How do we communicate with others to achieve our goals? We use our prior experience or advice from others, or construct a candidate utterance by predicting how it will be received. However, our experiences are limited and biased, and reasoning about potential outcomes can be difficult and cognitively challenging. In this paper, we explore how we can leverage Large Language Model (LLM) simulations to help us communicate better. We propose the Explore-Generate-Simulate (EGS) framework, which takes as input any scenario where an individual is communicating to an audience with a goal they want to achieve. EGS (1) explores the solution space by producing a diverse set of advice relevant to the scenario, (2) generates communication candidates conditioned on subsets of the advice, and (3) simulates the reactions from various audiences to determine both the best candidate and advice to use. We evaluate the framework on eight scenarios spanning the ten fundamental processes of interpersonal communication. For each scenario, we collect a dataset of human evaluations across candidates and baselines, and showcase that our framework's chosen candidate is preferred over popular generation mechanisms including Chain-of-Thought. We also find that audience simulations achieve reasonably high agreement with human raters across 5 of the 8 scenarios. Finally, we demonstrate the generality of our framework by applying it to real-world scenarios described by users on web forums. Through evaluations and demonstrations, we show that EGS enhances the effectiveness and outcomes of goal-oriented communication across a variety of situations, thus opening up new possibilities for the application of large language models in revolutionizing communication and decision-making processes.
    Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection. (arXiv:2305.18381v2 [cs.LG] UPDATED)
    Data-efficient learning has drawn significant attention, especially given the current trend of large multi-modal models, where dataset distillation can be an effective solution. However, the dataset distillation process itself is still very inefficient. In this work, we model the distillation problem with reference to information transport. Observing that severe data redundancy exists in dataset distillation, we argue to put more emphasis on the utility of the training samples. We propose a family of methods to exploit the most valuable samples, which is validated by our comprehensive analysis of the optimal data selection. The new strategy significantly reduces the training cost and extends a variety of existing distillation algorithms to larger and more diversified datasets, e.g., in some cases only 0.04% training data is sufficient for comparable distillation performance. Moreover, our strategy consistently enhances the performance, which may open up new analyses on the dynamics of distillation and networks. Our method is able to extend the distillation algorithms to much larger-scale datasets and more heterogeneous datasets, e.g., ImageNet-1K and Kinetics-400. Our code is available on https://github.com/silicx/GoldFromOres.
    MEDL-U: Uncertainty-aware 3D Automatic Annotation based on Evidential Deep Learning. (arXiv:2309.09599v2 [cs.CV] UPDATED)
    Advancements in deep learning-based 3D object detection necessitate the availability of large-scale datasets. However, this requirement introduces the challenge of manual annotation, which is often both burdensome and time-consuming. To tackle this issue, the literature has seen the emergence of several weakly supervised frameworks for 3D object detection which can automatically generate pseudo labels for unlabeled data. Nevertheless, these generated pseudo labels contain noise and are not as accurate as those labeled by humans. In this paper, we present the first approach that addresses the inherent ambiguities present in pseudo labels by introducing an Evidential Deep Learning (EDL) based uncertainty estimation framework. Specifically, we propose MEDL-U, an EDL framework based on MTrans, which not only generates pseudo labels but also quantifies the associated uncertainties. However, applying EDL to 3D object detection presents three primary challenges: (1) relatively lower pseudolabel quality in comparison to other autolabelers; (2) excessively high evidential uncertainty estimates; and (3) lack of clear interpretability and effective utilization of uncertainties for downstream tasks. We tackle these issues through the introduction of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and the implementation of a post-processing stage for uncertainty refinement. Our experimental results demonstrate that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Moreover, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.
    Fractional Denoising for 3D Molecular Pre-training. (arXiv:2307.10683v2 [q-bio.QM] UPDATED)
    Coordinate denoising is a promising 3D molecular pre-training method, which has achieved remarkable performance in various downstream drug discovery tasks. Theoretically, the objective is equivalent to learning the force field, which is revealed helpful for downstream tasks. Nevertheless, there are two challenges for coordinate denoising to learn an effective force field, i.e. low coverage samples and isotropic force field. The underlying reason is that molecular distributions assumed by existing denoising methods fail to capture the anisotropic characteristic of molecules. To tackle these challenges, we propose a novel hybrid noise strategy, including noises on both dihedral angel and coordinate. However, denoising such hybrid noise in a traditional way is no more equivalent to learning the force field. Through theoretical deductions, we find that the problem is caused by the dependency of the input conformation for covariance. To this end, we propose to decouple the two types of noise and design a novel fractional denoising method (Frad), which only denoises the latter coordinate part. In this way, Frad enjoys both the merits of sampling more low-energy structures and the force field equivalence. Extensive experiments show the effectiveness of Frad in molecular representation, with a new state-of-the-art on 9 out of 12 tasks of QM9 and on 7 out of 8 targets of MD17.  ( 2 min )
    Universal Domain Adaptation from Foundation Models: A Baseline Study. (arXiv:2305.11092v2 [cs.LG] UPDATED)
    Foundation models (e.g., CLIP or DINOv2) have shown their impressive learning and transfer capabilities in a wide range of visual tasks, by training on a large corpus of data and adapting to specific downstream tasks. It is, however, interesting that foundation models have not been fully explored for universal domain adaptation (UniDA), which is to learn models using labeled data in a source domain and unlabeled data in a target one, such that the learned models can successfully adapt to the target data. In this paper, we make comprehensive empirical studies of state-of-the-art UniDA methods using foundation models. We first observe that, unlike fine-tuning from ImageNet pre-trained models, as previous methods do, fine-tuning from foundation models yields significantly poorer results, sometimes even worse than training from scratch. While freezing the backbones, we demonstrate that although the foundation models greatly improve the performance of the baseline method that trains the models on the source data alone, existing UniDA methods generally fail to improve over the baseline. This suggests that new research efforts are very necessary for UniDA using foundation models. Based on these findings, we introduce \textit{CLIP distillation}, a parameter-free method specifically designed to distill target knowledge from CLIP models. The core of our \textit{CLIP distillation} lies in a self-calibration technique for automatic temperature scaling, a feature that significantly enhances the baseline's out-class detection capability. Although simple, our method outperforms previous approaches in most benchmark tasks, excelling in evaluation metrics including H-score/H$^3$-score and the newly proposed universal classification rate (UCR) metric. We hope that our investigation and the proposed simple framework can serve as a strong baseline to facilitate future studies in this field.
    Adversarial Attacks against Binary Similarity Systems. (arXiv:2303.11143v2 [cs.CR] UPDATED)
    In recent years, binary analysis gained traction as a fundamental approach to inspect software and guarantee its security. Due to the exponential increase of devices running software, much research is now moving towards new autonomous solutions based on deep learning models, as they have been showing state-of-the-art performances in solving binary analysis problems. One of the hot topics in this context is binary similarity, which consists in determining if two functions in assembly code are compiled from the same source code. However, it is unclear how deep learning models for binary similarity behave in an adversarial context. In this paper, we study the resilience of binary similarity models against adversarial examples, showing that they are susceptible to both targeted and untargeted attacks (w.r.t. similarity goals) performed by black-box and white-box attackers. In more detail, we extensively test three current state-of-the-art solutions for binary similarity against two black-box greedy attacks, including a new technique that we call Spatial Greedy, and one white-box attack in which we repurpose a gradient-guided strategy used in attacks to image classifiers.
    Improving Lesion Segmentation in FDG-18 Whole-Body PET/CT scans using Multilabel approach: AutoPET II challenge. (arXiv:2311.01574v1 [eess.IV])
    Automatic segmentation of lesions in FDG-18 Whole Body (WB) PET/CT scans using deep learning models is instrumental for determining treatment response, optimizing dosimetry, and advancing theranostic applications in oncology. However, the presence of organs with elevated radiotracer uptake, such as the liver, spleen, brain, and bladder, often leads to challenges, as these regions are often misidentified as lesions by deep learning models. To address this issue, we propose a novel approach of segmenting both organs and lesions, aiming to enhance the performance of automatic lesion segmentation methods. In this study, we assessed the effectiveness of our proposed method using the AutoPET II challenge dataset, which comprises 1014 subjects. We evaluated the impact of inclusion of additional labels and data in the segmentation performance of the model. In addition to the expert-annotated lesion labels, we introduced eight additional labels for organs, including the liver, kidneys, urinary bladder, spleen, lung, brain, heart, and stomach. These labels were integrated into the dataset, and a 3D UNET model was trained within the nnUNet framework. Our results demonstrate that our method achieved the top ranking in the held-out test dataset, underscoring the potential of this approach to significantly improve lesion segmentation accuracy in FDG-18 Whole-Body PET/CT scans, ultimately benefiting cancer patients and advancing clinical practice.
    Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products. (arXiv:2311.01960v1 [cs.DS])
    Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{\omega-1})$ time constant relative error approximation algorithm, where $\omega \approx 2.376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $\Omega(\min(n^{2-o(1)}, \Omega(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/\epsilon))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/\epsilon))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $\Omega(n^{2-o(1)})$ time.
    Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos. (arXiv:2311.02076v1 [cs.LG])
    In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.
    GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling. (arXiv:2311.01927v1 [cs.LG])
    Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost $O(l)$ recurrent mode and an efficient $O(l \log_{2} l)$ parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an $O(l^2)$ surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.
    To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning. (arXiv:2303.03374v2 [cs.LG] UPDATED)
    Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.
    Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel. (arXiv:2311.01762v1 [stat.ML])
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes a matrix inversion, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.
    On the Convergence of Encoder-only Shallow Transformers. (arXiv:2311.01575v1 [cs.LG])
    In this paper, we aim to build the global convergence theory of encoder-only shallow Transformers under a realistic setting from the perspective of architectures, initialization, and scaling under a finite width regime. The difficulty lies in how to tackle the softmax in self-attention mechanism, the core ingredient of Transformer. In particular, we diagnose the scaling scheme, carefully tackle the input/output of softmax, and prove that quadratic overparameterization is sufficient for global convergence of our shallow Transformers under commonly-used He/LeCun initialization in practice. Besides, neural tangent kernel (NTK) based analysis is also given, which facilitates a comprehensive comparison. Our theory demonstrates the separation on the importance of different scaling schemes and initialization. We believe our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
    Learning to Augment Distributions for Out-of-Distribution Detection. (arXiv:2311.01796v1 [cs.LG])
    Open-world classification systems should discern out-of-distribution (OOD) data whose labels deviate from those of in-distribution (ID) cases, motivating recent studies in OOD detection. Advanced works, despite their promising progress, may still fail in the open world, owing to the lack of knowledge about unseen OOD data in advance. Although one can access auxiliary OOD data (distinct from unseen ones) for model training, it remains to analyze how such auxiliary data will work in the open world. To this end, we delve into such a problem from a learning theory perspective, finding that the distribution discrepancy between the auxiliary and the unseen real OOD data is the key to affecting the open-world detection performance. Accordingly, we propose Distributional-Augmented OOD Learning (DAL), alleviating the OOD distribution discrepancy by crafting an OOD distribution set that contains all distributions in a Wasserstein ball centered on the auxiliary OOD distribution. We justify that the predictor trained over the worst OOD data in the ball can shrink the OOD distribution discrepancy, thus improving the open-world detection performance given only the auxiliary OOD data. We conduct extensive evaluations across representative OOD detection setups, demonstrating the superiority of our DAL over its advanced counterparts.
    Better Fair than Sorry: Adversarial Missing Data Imputation for Fair GNNs. (arXiv:2311.01591v1 [cs.LG])
    This paper addresses the problem of learning fair Graph Neural Networks (GNNs) under missing protected attributes. GNNs have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs assumes that either protected attributes are fully-observed or that the missing data imputation is fair. In practice, biases in the imputation will be propagated to the model outcomes, leading them to overestimate the fairness of their predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for protected attributes used by fair GNNs. The key design principle behind BFtS is that imputations should approximate the worst-case scenario for the fair GNN -- i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against the fair GNN. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness $\times$ accuracy trade-off than existing alternatives.
    Phase transitions in nonparametric regressions. (arXiv:2112.03626v7 [math.ST] UPDATED)
    When the unknown regression function of a single variable is known to have derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e. (i.e., $(\gamma+1)$th degree of smoothness), the minimax optimal rate of the mean integrated squared error (MISE) is stated as $\left(\frac{1}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ in the literature. This paper shows that: (i) if $n\leq\left(\gamma+1\right)^{2\gamma+3}$, the minimax optimal MISE rate is $\frac{\log n}{n\log(\log n)}$ and the optimal degree of smoothness to exploit is roughly $\max\left\{ \left\lfloor \frac{\log n}{2\log\left(\log n\right)}\right\rfloor ,\,1\right\} $; (ii) if $n>\left(\gamma+1\right)^{2\gamma+3}$, the minimax optimal MISE rate is $\left(\frac{1}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ and the optimal degree of smoothness to exploit is $\gamma+1$. The fundamental contribution of this paper is a set of metric entropy bounds we develop for smooth function classes. Some of our bounds are original, and some of them improve and/or generalize the ones in the literature (e.g., Kolmogorov and Tikhomirov, 1959). Our metric entropy bounds allow us to show phase transitions in the minimax optimal MISE rates associated with some commonly seen smoothness classes as well as non-standard smoothness classes, and can also be of independent interest outside the nonparametric regression problems.
    A Statistical Guarantee for Representation Transfer in Multitask Imitation Learning. (arXiv:2311.01589v1 [cs.LG])
    Transferring representation for multitask imitation learning has the potential to provide improved sample efficiency on learning new tasks, when compared to learning from scratch. In this work, we provide a statistical guarantee indicating that we can indeed achieve improved sample efficiency on the target task when a representation is trained using sufficiently diverse source tasks. Our theoretical results can be readily extended to account for commonly used neural network architectures with realistic assumptions. We conduct empirical analyses that align with our theoretical findings on four simulated environments$\unicode{x2014}$in particular leveraging more data from source tasks can improve sample efficiency on learning in the new task.
    Efficient Generalized Low-Rank Tensor Contextual Bandits. (arXiv:2311.01771v1 [cs.LG])
    In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.
    Adversary ML Resilience in Autonomous Driving Through Human Centered Perception Mechanisms. (arXiv:2311.01478v1 [cs.CV])
    Physical adversarial attacks on road signs are continuously exploiting vulnerabilities in modern day autonomous vehicles (AVs) and impeding their ability to correctly classify what type of road sign they encounter. Current models cannot generalize input data well, resulting in overfitting or underfitting. In overfitting, the model memorizes the input data but cannot generalize to new scenarios. In underfitting, the model does not learn enough of the input data to accurately classify these road signs. This paper explores the resilience of autonomous driving systems against three main physical adversarial attacks (tape, graffiti, illumination), specifically targeting object classifiers. Several machine learning models were developed and evaluated on two distinct datasets: road signs (stop signs, speed limit signs, traffic lights, and pedestrian crosswalk signs) and geometric shapes (octagons, circles, squares, and triangles). The study compared algorithm performance under different conditions, including clean and adversarial training and testing on these datasets. To build robustness against attacks, defense techniques like adversarial training and transfer learning were implemented. Results demonstrated transfer learning models played a crucial role in performance by allowing knowledge gained from shape training to improve generalizability of road sign classification, despite the datasets being completely different. The paper suggests future research directions, including human-in-the-loop validation, security analysis, real-world testing, and explainable AI for transparency. This study aims to contribute to improving security and robustness of object classifiers in autonomous vehicles and mitigating adversarial example impacts on driving systems.
    Learning Sparse Codes with Entropy-Based ELBOs. (arXiv:2311.01888v1 [stat.ML])
    Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.
    Optimistic Multi-Agent Policy Gradient for Cooperative Tasks. (arXiv:2311.01953v1 [cs.LG])
    \textit{Relative overgeneralization} (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. In early work, optimism has been shown to mitigate the \textit{RO} problem when using tabular Q-learning. However, with function approximation optimism can amplify overestimation and thus fail on complex tasks. On the other hand, recent deep multi-agent policy gradient (MAPG) methods have succeeded in many complex tasks but may fail with severe \textit{RO}. We propose a general, yet simple, framework to enable optimistic updates in MAPG methods and alleviate the RO problem. Specifically, we employ a \textit{Leaky ReLU} function where a single hyperparameter selects the degree of optimism to reshape the advantages when updating the policy. Intuitively, our method remains optimistic toward individual actions with lower returns which are potentially caused by other agents' sub-optimal behavior during learning. The optimism prevents the individual agents from quickly converging to a local optimum. We also provide a formal analysis from an operator view to understand the proposed advantage transformation. In extensive evaluations on diverse sets of tasks, including illustrative matrix games, complex \textit{Multi-agent MuJoCo} and \textit{Overcooked} benchmarks, the proposed method\footnote{Code can be found at \url{https://github.com/wenshuaizhao/optimappo}.} outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.
    VQPy: An Object-Oriented Approach to Modern Video Analytics. (arXiv:2311.01623v1 [cs.CV])
    Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a frontend$\unicode{x2015}$a Python variant with constructs that make it easy for users to express video objects and their interactions$\unicode{x2015}$as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which has been productized in Cisco as part of its DeepVision framework.
    Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling. (arXiv:2311.01660v1 [cs.LG])
    Survival analysis is a widely-used technique for analyzing time-to-event data in the presence of censoring. In recent years, numerous survival analysis methods have emerged which scale to large datasets and relax traditional assumptions such as proportional hazards. These models, while being performant, are very sensitive to model hyperparameters including: (1) number of bins and bin size for discrete models and (2) number of cluster assignments for mixture-based models. Each of these choices requires extensive tuning by practitioners to achieve optimal performance. In addition, we demonstrate in empirical studies that: (1) optimal bin size may drastically differ based on the metric of interest (e.g., concordance vs brier score), and (2) mixture models may suffer from mode collapse and numerical instability. We propose a survival analysis approach which eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that the proposed approach matches or outperforms baselines on several real-world datasets.
    Should Under-parameterized Student Networks Copy or Average Teacher Weights?. (arXiv:2311.01644v1 [cs.LG])
    Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.
    Sequential Subset Matching for Dataset Distillation. (arXiv:2311.01570v1 [cs.CV])
    Dataset distillation is a newly emerging task that synthesizes a small-size dataset used in training deep neural networks (DNNs) for reducing data storage and model training costs. The synthetic datasets are expected to capture the essence of the knowledge contained in real-world datasets such that the former yields a similar performance as the latter. Recent advancements in distillation methods have produced notable improvements in generating synthetic datasets. However, current state-of-the-art methods treat the entire synthetic dataset as a unified entity and optimize each synthetic instance equally. This static optimization approach may lead to performance degradation in dataset distillation. Specifically, we argue that static optimization can give rise to a coupling issue within the synthetic data, particularly when a larger amount of synthetic data is being optimized. This coupling issue, in turn, leads to the failure of the distilled dataset to extract the high-level features learned by the deep neural network (DNN) in the latter epochs. In this study, we propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch), which tackles this problem by adaptively optimizing the synthetic data to encourage sequential acquisition of knowledge during dataset distillation. Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances, thereby enhancing its performance significantly. Our proposed SeqMatch outperforms state-of-the-art methods in various datasets, including SVNH, CIFAR-10, CIFAR-100, and Tiny ImageNet. Our code is available at https://github.com/shqii1j/seqmatch.
    Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula. (arXiv:2311.01642v1 [cs.LG])
    Robustness against adversarial attacks and distribution shifts is a long-standing goal of Reinforcement Learning (RL). To this end, Robust Adversarial Reinforcement Learning (RARL) trains a protagonist against destabilizing forces exercised by an adversary in a competitive zero-sum Markov game, whose optimal solution, i.e., rational strategy, corresponds to a Nash equilibrium. However, finding Nash equilibria requires facing complex saddle point optimization problems, which can be prohibitive to solve, especially for high-dimensional control. In this paper, we propose a novel approach for adversarial RL based on entropy regularization to ease the complexity of the saddle point optimization problem. We show that the solution of this entropy-regularized problem corresponds to a Quantal Response Equilibrium (QRE), a generalization of Nash equilibria that accounts for bounded rationality, i.e., agents sometimes play random actions instead of optimal ones. Crucially, the connection between the entropy-regularized objective and QRE enables free modulation of the rationality of the agents by simply tuning the temperature coefficient. We leverage this insight to propose our novel algorithm, Quantal Adversarial RL (QARL), which gradually increases the rationality of the adversary in a curriculum fashion until it is fully rational, easing the complexity of the optimization problem while retaining robustness. We provide extensive evidence of QARL outperforming RARL and recent baselines across several MuJoCo locomotion and navigation problems in overall performance and robustness.
    Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison. (arXiv:2311.01537v1 [stat.ML])
    Two-sample testing decides whether two datasets are generated from the same distribution. This paper studies variable selection for two-sample testing, the task being to identify the variables (or dimensions) responsible for the discrepancies between the two distributions. This task is relevant to many problems of pattern analysis and machine learning, such as dataset shift adaptation, causal inference and model validation. Our approach is based on a two-sample test based on the Maximum Mean Discrepancy (MMD). We optimise the Automatic Relevance Detection (ARD) weights defined for individual variables to maximise the power of the MMD-based test. For this optimisation, we introduce sparse regularisation and propose two methods for dealing with the issue of selecting an appropriate regularisation parameter. One method determines the regularisation parameter in a data-driven way, and the other aggregates the results of different regularisation parameters. We confirm the validity of the proposed methods by systematic comparisons with baseline methods, and demonstrate their usefulness in exploratory analysis of high-dimensional traffic simulation data. Preliminary theoretical analyses are also provided, including a rigorous definition of variable selection for two-sample testing.
    Score Models for Offline Goal-Conditioned Reinforcement Learning. (arXiv:2311.02013v1 [cs.LG])
    Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
    Deep Learning for blind spectral unmixing of LULC classes with MODIS multispectral time series and ancillary data. (arXiv:2310.07223v2 [cs.CV] UPDATED)
    Remotely sensed data are dominated by mixed Land Use and Land Cover (LULC) types. Spectral unmixing is a technique to extract information from mixed pixels into their constituent LULC types and corresponding abundance fractions. Traditionally, solving this task has relied on either classical methods that require prior knowledge of endmembers or machine learning methods that avoid explicit endmembers calculation, also known as blind spectral unmixing (BSU). Most BSU studies based on Deep Learning (DL) focus on one time-step hyperspectral or multispectral data. To our knowledge, here we provide the first study on BSU of LULC classes using MODIS multispectral time series, in presence of missing data, with end-to-end DL models. We further boost the performance of a Long-Short Term Memory (LSTM)-based model by incorporating geographic plus topographic (geo-topographic) and climatic ancillary information. Our experiments show that combining spectral-temporal input data together with geo-topographic and climatic information substantially improves the abundance estimation of LULC classes in mixed pixels. To carry out this study, we built a new labeled dataset of the region of Andalusia (Spain) with monthly multispectral time series of pixels for the year 2013 from MODIS at 460m resolution, for two hierarchical levels of LULC classes, named Andalusia MultiSpectral MultiTemporal Unmixing (Andalusia-MSMTU). This dataset provides, at the pixel level, a multispectral time series plus ancillary information annotated with the abundance of each LULC class inside each pixel. The dataset (https://zenodo.org/record/7752348##.ZBmkkezMLdo) and code (https://github.com/jrodriguezortega/MSMTU) are available to the public.  ( 3 min )
    Cost-aware Generalized $\alpha$-investing for Multiple Hypothesis Testing. (arXiv:2210.17514v3 [cs.LG] UPDATED)
    We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes of a disease process. This work builds on the generalized $\alpha$-investing framework which enables control of the false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of $\alpha$-wealth which motivates a consideration of sample size in the $\alpha$-investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected $\alpha$-wealth reward (ERO) and provides an optimal sample size for each test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods for $n=1$ where $n$ is the sample size. When the sample size is not fixed cost-aware ERO uses a prior on the null hypothesis to adaptively allocate of the sample budget to each test. We extend cost-aware ERO investing to finite-horizon testing which enables the decision rule to allocate samples in a non-myopic manner. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO balances the allocation of samples to an individual test against the allocation of samples across multiple tests.
    Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees. (arXiv:2311.01806v1 [math.OC])
    Randomized algorithms are important for solving large-scale optimization problems. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization (SRO). Our SRO algorithm first generates a sketch of the original data matrix, then solves the sketched problem. Different from existing randomized algorithms, our algorithm handles general Frechet subdifferentiable regularization functions in an unified framework. We present general theoretical result for the approximation error between the optimization results of the original problem and the sketched problem for regularized least square problems which can be convex or nonconvex. For arbitrary convex regularizer, relative-error bound is proved for the approximation error. Importantly, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are also obtained using our general theoretical result under mild conditions. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. We further propose an iterative sketching algorithm which reduces the approximation error exponentially by iteratively invoking the sketching algorithm. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.
    On the Generalization Properties of Diffusion Models. (arXiv:2311.01797v1 [cs.LG])
    Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.
    Graph Neural Diffusion Networks for Semi-supervised Learning. (arXiv:2201.09698v2 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCN) is a pioneering model for graph-based semi-supervised learning. However, GCN does not perform well on sparsely-labeled graphs. Its two-layer version cannot effectively propagate the label information to the whole graph structure (i.e., the under-smoothing problem) while its deep version over-smoothens and is hard to train (i.e., the over-smoothing problem). To solve these two issues, we propose a new graph neural network called GND-Nets (for Graph Neural Diffusion Networks) that exploits the local and global neighborhood information of a vertex in a single layer. Exploiting the shallow network mitigates the over-smoothing problem while exploiting the local and global neighborhood information mitigates the under-smoothing problem. The utilization of the local and global neighborhood information of a vertex is achieved by a new graph diffusion method called neural diffusions, which integrate neural networks into the conventional linear and nonlinear graph diffusions. The adoption of neural networks makes neural diffusions adaptable to different datasets. Extensive experiments on various sparsely-labeled graphs verify the effectiveness and efficiency of GND-Nets compared to state-of-the-art approaches.  ( 2 min )
    Object-Centric Slot Diffusion. (arXiv:2303.10834v5 [cs.CV] UPDATED)
    The recent success of transformer-based image generative models in object-centric learning highlights the importance of powerful image generators for handling complex scenes. However, despite the high expressiveness of diffusion models in image generation, their integration into object-centric learning remains largely unexplored in this domain. In this paper, we explore the feasibility and potential of integrating diffusion models into object-centric learning and investigate the pros and cons of this approach. We introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes: it is the first object-centric learning model to replace conventional slot decoders with a latent diffusion model conditioned on object slots, and it is also the first unsupervised compositional conditional diffusion model that operates without the need for supervised annotations like text. Through experiments on various object-centric tasks, including the first application of the FFHQ dataset in this field, we demonstrate that LSD significantly outperforms state-of-the-art transformer-based decoders, particularly in more complex scenes, and exhibits superior unsupervised compositional generation quality. In addition, we conduct a preliminary investigation into the integration of pre-trained diffusion models in LSD and demonstrate its effectiveness in real-world image segmentation and generation. Project page is available at https://latentslotdiffusion.github.io  ( 2 min )
    GRANDE: Gradient-Based Decision Tree Ensembles. (arXiv:2309.17130v2 [cs.LG] UPDATED)
    Despite the success of deep learning for text and image data, tree-based ensemble models are still state-of-the-art for machine learning with heterogeneous tabular data. However, there is a significant need for tabular-specific gradient-based methods due to their high flexibility. In this paper, we propose $\text{GRANDE}$, $\text{GRA}$die$\text{N}$t-Based $\text{D}$ecision Tree $\text{E}$nsembles, a novel approach for learning hard, axis-aligned decision tree ensembles using end-to-end gradient descent. GRANDE is based on a dense representation of tree ensembles, which affords to use backpropagation with a straight-through operator to jointly optimize all model parameters. Our method combines axis-aligned splits, which is a useful inductive bias for tabular data, with the flexibility of gradient-based optimization. Furthermore, we introduce an advanced instance-wise weighting that facilitates learning representations for both, simple and complex relations, within a single model. We conducted an extensive evaluation on a predefined benchmark with 19 classification datasets and demonstrate that our method outperforms existing gradient-boosting and deep learning frameworks on most datasets. The method is available under: https://github.com/s-marton/GRANDE  ( 2 min )
    Modality Cycles with Masked Conditional Diffusion for Unsupervised Anomaly Segmentation in MRI. (arXiv:2308.16150v3 [eess.IV] UPDATED)
    Unsupervised anomaly segmentation aims to detect patterns that are distinct from any patterns processed during training, commonly called abnormal or out-of-distribution patterns, without providing any associated manual segmentations. Since anomalies during deployment can lead to model failure, detecting the anomaly can enhance the reliability of models, which is valuable in high-risk domains like medical imaging. This paper introduces Masked Modality Cycles with Conditional Diffusion (MMCCD), a method that enables segmentation of anomalies across diverse patterns in multimodal MRI. The method is based on two fundamental ideas. First, we propose the use of cyclic modality translation as a mechanism for enabling abnormality detection. Image-translation models learn tissue-specific modality mappings, which are characteristic of tissue physiology. Thus, these learned mappings fail to translate tissues or image patterns that have never been encountered during training, and the error enables their segmentation. Furthermore, we combine image translation with a masked conditional diffusion model, which attempts to `imagine' what tissue exists under a masked area, further exposing unknown patterns as the generative model fails to recreate them. We evaluate our method on a proxy task by training on healthy-looking slices of BraTS2021 multi-modality MRIs and testing on slices with tumors. We show that our method compares favorably to previous unsupervised approaches based on image reconstruction and denoising with autoencoders and diffusion models.  ( 3 min )
    Algorithm Selection for Deep Active Learning with Imbalanced Datasets. (arXiv:2302.07317v3 [cs.LG] UPDATED)
    Label efficiency has become an increasingly important objective in deep learning applications. Active learning aims to reduce the number of labeled examples needed to train deep networks, but the empirical performance of active learning algorithms can vary dramatically across datasets and applications. It is difficult to know in advance which active learning strategy will perform well or best in a given application. To address this, we propose the first adaptive algorithm selection strategy for deep active learning. For any unlabeled dataset, our (meta) algorithm TAILOR (Thompson ActIve Learning algORithm selection) iteratively and adaptively chooses among a set of candidate active learning algorithms. TAILOR uses novel reward functions aimed at gathering class-balanced examples. Extensive experiments in multi-class and multi-label applications demonstrate TAILOR's effectiveness in achieving accuracy comparable or better than that of the best of the candidate algorithms. Our implementation of TAILOR is open-sourced at https://github.com/jifanz/TAILOR.  ( 2 min )
    Allocating Divisible Resources on Arms with Unknown and Random Rewards. (arXiv:2306.16578v2 [cs.LG] UPDATED)
    We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource. In particular, if the decision maker allocates resource $A_i$ to arm $i$ in a period, then the reward $Y_i$ is$Y_i(A_i)=A_i \mu_i+A_i^b \xi_{i}$, where $\mu_i$ is the unknown mean and the noise $\xi_{i}$ is independent and sub-Gaussian. When the order $b$ ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for $b\in [0,1]$, and demonstrate a phase transition at $b=1/2$. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.  ( 2 min )
    Disentangled Representation Learning with Transmitted Information Bottleneck. (arXiv:2311.01686v1 [cs.CV])
    Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop; 2) the disentanglement constraints on representations are in complicated optimization. To these issues, we introduce Bayesian networks with transmitted information to formulate the interaction among input and representations during disentanglement. Building upon this framework, we propose \textbf{DisTIB} (\textbf{T}ransmitted \textbf{I}nformation \textbf{B}ottleneck for \textbf{Dis}entangled representation learning), a novel objective that navigates the balance between information compression and preservation. We employ variational inference to derive a tractable estimation for DisTIB. This estimation can be simply optimized via standard gradient descent with a reparameterization trick. Moreover, we theoretically prove that DisTIB can achieve optimal disentanglement, underscoring its superior efficacy. To solidify our claims, we conduct extensive experiments on various downstream tasks to demonstrate the appealing efficacy of DisTIB and validate our theoretical analyses.  ( 2 min )
    On some limitations of data-driven weather forecasting models. (arXiv:2309.08473v2 [stat.ML] UPDATED)
    As in many other areas of engineering and applied science, Machine Learning (ML) is having a profound impact in the domain of Weather and Climate Prediction. A very recent development in this area has been the emergence of fully data-driven ML prediction models which routinely claim superior performance to that of traditional physics-based models. In this work, we examine some aspects of the forecasts produced by an exemplar of the current generation of ML models, Pangu-Weather, with a focus on the fidelity and physical consistency of those forecasts and how these characteristics relate to perceived forecast performance. The main conclusion is that Pangu-Weather forecasts, and possibly those of similar ML models, do not have the fidelity and physical consistency of physics-based models and their advantage in accuracy on traditional deterministic metrics of forecast skill can be at least partly attributed to these peculiarities. Balancing forecast skill and physical consistency of ML-driven predictions will be an important consideration for future ML models. However, and similarly to other modern post-processing technologies, the current ML models appear to be already able to add value to standard NWP output for specific forecast applications and combined with their extremely low computational cost during deployment, are set to provide an additional, useful source of forecast information. .  ( 2 min )
    Fairness Improvement with Multiple Protected Attributes: How Far Are We?. (arXiv:2308.01923v2 [cs.LG] UPDATED)
    Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on precision and recall when handling multiple protected attributes is about 5 times and 8 times that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.  ( 2 min )
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v5 [cs.LG] UPDATED)
    We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.  ( 2 min )
    Bayesian learning of feature spaces for multitasks problems. (arXiv:2209.03028v2 [stat.ML] UPDATED)
    This paper introduces a novel approach for multi-task regression that connects Kernel Machines (KMs) and Extreme Learning Machines (ELMs) through the exploitation of the Random Fourier Features (RFFs) approximation of the RBF kernel. In this sense, one of the contributions of this paper shows that for the proposed models, the KM and the ELM formulations can be regarded as two sides of the same coin. These proposed models, termed RFF-BLR, stand on a Bayesian framework that simultaneously addresses two main design goals. On the one hand, it fits multitask regressors based on KMs endowed with RBF kernels. On the other hand, it enables the introduction of a common-across-tasks prior that promotes multioutput sparsity in the ELM view. This Bayesian approach facilitates the simultaneous consideration of both the KM and ELM perspectives enabling (i) the optimisation of the RBF kernel parameter $\gamma$ within a probabilistic framework, (ii) the optimisation of the model complexity, and (iii) an efficient transfer of knowledge across tasks. The experimental results show that this framework can lead to significant performance improvements compared to the state-of-the-art methods in multitask nonlinear regression.  ( 2 min )
    Communication-Efficient Federated Non-Linear Bandit Optimization. (arXiv:2311.01695v1 [cs.LG])
    Federated optimization studies the problem of collaborative function optimization among multiple clients (e.g. mobile devices or organizations) under the coordination of a central server. Since the data is collected separately by each client and always remains decentralized, federated optimization preserves data privacy and allows for large-scale computing, which makes it a promising decentralized machine learning paradigm. Though it is often deployed for tasks that are online in nature, e.g., next-word prediction on keyboard apps, most works formulate it as an offline problem. The few exceptions that consider federated bandit optimization are limited to very simplistic function classes, e.g., linear, generalized linear, or non-parametric function class with bounded RKHS norm, which severely hinders its practical usage. In this paper, we propose a new algorithm, named Fed-GO-UCB, for federated bandit optimization with generic non-linear objective function. Under some mild conditions, we rigorously prove that Fed-GO-UCB is able to achieve sub-linear rate for both cumulative regret and communication cost. At the heart of our theoretical analysis are distributed regression oracle and individual confidence set construction, which can be of independent interests. Empirical evaluations also demonstrate the effectiveness of the proposed algorithm.  ( 2 min )
    Fine-Tuning Language Models with Advantage-Induced Policy Alignment. (arXiv:2306.02231v3 [cs.CL] UPDATED)
    Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.  ( 2 min )
  • Open

    Bayesian Quantile Regression with Subset Selection: A Posterior Summarization Perspective. (arXiv:2311.02043v1 [stat.ME])
    Quantile regression is a powerful tool for inferring how covariates affect specific percentiles of the response distribution. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or non-parametric models. The former often produce inadequate models for real data and do not share information across quantiles, while the latter are characterized by complex and constrained models that can be difficult to interpret and computationally inefficient. Further, neither approach is well-suited for quantile-specific subset selection. Instead, we pose the fundamental problems of linear quantile estimation, uncertainty quantification, and subset selection from a Bayesian decision analysis perspective. For any Bayesian regression model, we derive optimal and interpretable linear estimates and uncertainty quantification for each model-based conditional quantile. Our approach introduces a quantile-focused squared error loss, which enables efficient, closed-form computing and maintains a close relationship with Wasserstein-based density estimation. In an extensive simulation study, our methods demonstrate substantial gains in quantile estimation accuracy, variable selection, and inference over frequentist and Bayesian competitors. We apply these tools to identify the quantile-specific impacts of social and environmental stressors on educational outcomes for a large cohort of children in North Carolina.
    Multilayer hypergraph clustering using the aggregate similarity matrix. (arXiv:2301.11657v3 [math.ST] UPDATED)
    We consider the community recovery problem on a multilayer variant of the hypergraph stochastic block model (HSBM). Each layer is associated with an independent realization of a d-uniform HSBM on N vertices. Given the similarity matrix containing the aggregated number of hyperedges incident to each pair of vertices, the goal is to obtain a partition of the N vertices into disjoint communities. In this work, we investigate a semidefinite programming (SDP) approach and obtain information-theoretic conditions on the model parameters that guarantee exact recovery both in the assortative and the disassortative cases.
    Differentially Private Topological Data Analysis. (arXiv:2305.03609v2 [stat.ML] UPDATED)
    This paper is the first to attempt differentially private (DP) topological data analysis (TDA), producing near-optimal private persistence diagrams. We analyze the sensitivity of persistence diagrams in terms of the bottleneck distance, and we show that the commonly used \v{C}ech complex has sensitivity that does not decrease as the sample size $n$ increases. This makes it challenging for the persistence diagrams of \v{C}ech complexes to be privatized. As an alternative, we show that the persistence diagram obtained by the $L^1$-distance to measure (DTM) has sensitivity $O(1/n)$. Based on the sensitivity analysis, we propose using the exponential mechanism whose utility function is defined in terms of the bottleneck distance of the $L^1$-DTM persistence diagrams. We also derive upper and lower bounds of the accuracy of our privacy mechanism; the obtained bounds indicate that the privacy error of our mechanism is near-optimal. We demonstrate the performance of our privatized persistence diagrams through simulations as well as on a real dataset tracking human movement.
    Doubly Robust Self-Training. (arXiv:2306.00265v3 [cs.LG] UPDATED)
    Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
    Adaptive Algorithms for Relaxed Pareto Set Identification. (arXiv:2307.00424v2 [stat.ML] UPDATED)
    In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most $k$ Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.
    Long Sequence Hopfield Memory. (arXiv:2306.04532v2 [cs.NE] CROSS LISTED)
    Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.
    Recurrent Neural-Linear Posterior Sampling for Nonstationary Contextual Bandits. (arXiv:2007.04750v2 [cs.LG] UPDATED)
    An agent in a nonstationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a nonstationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and noncontextual nonstationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional nonstationary bandit algorithms. Although it is very difficult to provide theoretical performance guarantees for our new approach, we also prove a novel regret bound for linear posterior sampling with measurement error that may serve as a foundation for future theoretical work.
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v5 [cs.LG] UPDATED)
    We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.
    Latent Diffusion Model for Conditional Reservoir Facies Generation. (arXiv:2311.01968v1 [physics.geo-ph])
    Creating accurate and geologically realistic reservoir facies based on limited measurements is crucial for field development and reservoir management, especially in the oil and gas sector. Traditional two-point geostatistics, while foundational, often struggle to capture complex geological patterns. Multi-point statistics offers more flexibility, but comes with its own challenges. With the rise of Generative Adversarial Networks (GANs) and their success in various fields, there has been a shift towards using them for facies generation. However, recent advances in the computer vision domain have shown the superiority of diffusion models over GANs. Motivated by this, a novel Latent Diffusion Model is proposed, which is specifically designed for conditional generation of reservoir facies. The proposed model produces high-fidelity facies realizations that rigorously preserve conditioning data. It significantly outperforms a GAN-based alternative.
    Convex and Non-convex Optimization Under Generalized Smoothness. (arXiv:2306.01264v2 [math.OC] UPDATED)
    Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.
    Learning nonparametric latent causal graphs with unknown interventions. (arXiv:2306.02899v2 [stat.ML] UPDATED)
    We establish conditions under which latent causal graphs are nonparametrically identifiable and can be reconstructed from unknown interventions in the latent space. Our primary focus is the identification of the latent structure in measurement models without parametric assumptions such as linearity or Gaussianity. Moreover, we do not assume the number of hidden variables is known, and we show that at most one unknown intervention per hidden variable is needed. This extends a recent line of work on learning causal representations from observations and interventions. The proofs are constructive and introduce two new graphical concepts -- imaginary subsets and isolated edges -- that may be useful in their own right. As a matter of independent interest, the proofs also involve a novel characterization of the limits of edge orientations within the equivalence class of DAGs induced by unknown interventions. These are the first results to characterize the conditions under which causal representations are identifiable without making any parametric assumptions in a general setting with unknown interventions and without faithfulness.
    Faithful and Robust Local Interpretability for Textual Predictions. (arXiv:2311.01605v1 [cs.CL])
    Interpretability is essential for machine learning models to be trusted and deployed in critical domains. However, existing methods for interpreting text models are often complex, lack solid mathematical foundations, and their performance is not guaranteed. In this paper, we propose FRED (Faithful and Robust Explainer for textual Documents), a novel method for interpreting predictions over text. FRED identifies key words in a document that significantly impact the prediction when removed. We establish the reliability of FRED through formal definitions and theoretical analyses on interpretable classifiers. Additionally, our empirical evaluation against state-of-the-art methods demonstrates the effectiveness of FRED in providing insights into text models.
    Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization. (arXiv:2307.02108v3 [cs.LG] UPDATED)
    In many applications, e.g. in healthcare and e-commerce, the goal of a contextual bandit may be to learn an optimal treatment assignment policy at the end of the experiment. That is, to minimize simple regret. However, this objective remains understudied. We propose a new family of computationally efficient bandit algorithms for the stochastic contextual bandit setting, where a tuning parameter determines the weight placed on cumulative regret minimization (where we establish near-optimal minimax guarantees) versus simple regret minimization (where we establish state-of-the-art guarantees). Our algorithms work with any function class, are robust to model misspecification, and can be used in continuous arm settings. This flexibility comes from constructing and relying on "conformal arm sets" (CASs). CASs provide a set of arms for every context, encompassing the context-specific optimal arm with a certain probability across the context distribution. Our positive results on simple and cumulative regret guarantees are contrasted with a negative result, which shows that no algorithm can achieve instance-dependent simple regret guarantees while simultaneously achieving minimax optimal cumulative regret guarantees.
    Variable Selection in Maximum Mean Discrepancy for Interpretable Distribution Comparison. (arXiv:2311.01537v1 [stat.ML])
    Two-sample testing decides whether two datasets are generated from the same distribution. This paper studies variable selection for two-sample testing, the task being to identify the variables (or dimensions) responsible for the discrepancies between the two distributions. This task is relevant to many problems of pattern analysis and machine learning, such as dataset shift adaptation, causal inference and model validation. Our approach is based on a two-sample test based on the Maximum Mean Discrepancy (MMD). We optimise the Automatic Relevance Detection (ARD) weights defined for individual variables to maximise the power of the MMD-based test. For this optimisation, we introduce sparse regularisation and propose two methods for dealing with the issue of selecting an appropriate regularisation parameter. One method determines the regularisation parameter in a data-driven way, and the other aggregates the results of different regularisation parameters. We confirm the validity of the proposed methods by systematic comparisons with baseline methods, and demonstrate their usefulness in exploratory analysis of high-dimensional traffic simulation data. Preliminary theoretical analyses are also provided, including a rigorous definition of variable selection for two-sample testing.
    Transport, Variational Inference and Diffusions: with Applications to Annealed Flows and Schr\"odinger Bridges. (arXiv:2307.01050v3 [stat.ML] UPDATED)
    This paper explores the connections between optimal transport and variational inference, with a focus on forward and reverse time stochastic differential equations and Girsanov transformations.We present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of a novel score-based annealed flow technique (with connections to Jarzynski and Crooks identities from statistical physics) and a regularised iterative proportional fitting (IPF)-type objective, departing from the sequential nature of standard IPF. Through a series of generative modelling examples and a double-well-based rare event task, we showcase the potential of the proposed methods.
    Minimax Quasi-Bayesian estimation in sparse canonical correlation analysis via a Rayleigh quotient function. (arXiv:2010.08627v3 [stat.ML] UPDATED)
    Canonical correlation analysis (CCA) is a popular statistical technique for exploring relationships between datasets. In recent years, the estimation of sparse canonical vectors has emerged as an important but challenging variant of the CCA problem, with widespread applications. Unfortunately, existing rate-optimal estimators for sparse canonical vectors have high computational cost. We propose a quasi-Bayesian estimation procedure that not only achieves the minimax estimation rate, but also is easy to compute by Markov Chain Monte Carlo (MCMC). The method builds on Tan et al. (2018) and uses a re-scaled Rayleigh quotient function as the quasi-log-likelihood. However, unlike Tan et al. (2018), we adopt a Bayesian framework that combines this quasi-log-likelihood with a spike-and-slab prior to regularize the inference and promote sparsity. We investigate the empirical behavior of the proposed method on both continuous and truncated data, and we demonstrate that it outperforms several state-of-the-art methods. As an application, we use the proposed methodology to maximally correlate clinical variables and proteomic data for better understanding the Covid-19 disease.
    Allocating Divisible Resources on Arms with Unknown and Random Rewards. (arXiv:2306.16578v2 [cs.LG] UPDATED)
    We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource. In particular, if the decision maker allocates resource $A_i$ to arm $i$ in a period, then the reward $Y_i$ is$Y_i(A_i)=A_i \mu_i+A_i^b \xi_{i}$, where $\mu_i$ is the unknown mean and the noise $\xi_{i}$ is independent and sub-Gaussian. When the order $b$ ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for $b\in [0,1]$, and demonstrate a phase transition at $b=1/2$. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.
    To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning. (arXiv:2303.03374v2 [cs.LG] UPDATED)
    Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape, which we call the pre-train basin, and thus have limited diversity. In this work, we show that ensembles trained from a single pre-trained checkpoint may be improved by better exploring the pre-train basin, however, leaving the basin results in losing the benefits of transfer learning and in degradation of the ensemble quality. Based on the analysis of existing exploration methods, we propose a more effective modification of the Snapshot Ensembles (SSE) for transfer learning setup, StarSSE, which results in stronger ensembles and uniform model soups.
    Provably Convergent Data-Driven Convex-Nonconvex Regularization. (arXiv:2310.05812v2 [cs.LG] UPDATED)
    An emerging new paradigm for solving inverse problems is via the use of deep learning to learn a regularizer from data. This leads to high-quality results, but often at the cost of provable guarantees. In this work, we show how well-posedness and convergent regularization arises within the convex-nonconvex (CNC) framework for inverse problems. We introduce a novel input weakly convex neural network (IWCNN) construction to adapt the method of learned adversarial regularization to the CNC framework. Empirically we show that our method overcomes numerical issues of previous adversarial methods.
    On some limitations of data-driven weather forecasting models. (arXiv:2309.08473v2 [stat.ML] UPDATED)
    As in many other areas of engineering and applied science, Machine Learning (ML) is having a profound impact in the domain of Weather and Climate Prediction. A very recent development in this area has been the emergence of fully data-driven ML prediction models which routinely claim superior performance to that of traditional physics-based models. In this work, we examine some aspects of the forecasts produced by an exemplar of the current generation of ML models, Pangu-Weather, with a focus on the fidelity and physical consistency of those forecasts and how these characteristics relate to perceived forecast performance. The main conclusion is that Pangu-Weather forecasts, and possibly those of similar ML models, do not have the fidelity and physical consistency of physics-based models and their advantage in accuracy on traditional deterministic metrics of forecast skill can be at least partly attributed to these peculiarities. Balancing forecast skill and physical consistency of ML-driven predictions will be an important consideration for future ML models. However, and similarly to other modern post-processing technologies, the current ML models appear to be already able to add value to standard NWP output for specific forecast applications and combined with their extremely low computational cost during deployment, are set to provide an additional, useful source of forecast information. .
    Gradient Flows for Sampling: Mean-Field Models, Gaussian Approximations and Affine Invariance. (arXiv:2302.11024v6 [stat.ML] UPDATED)
    Sampling a probability distribution with an unknown normalization constant is a fundamental problem in computational science and engineering. This task may be cast as an optimization problem over all probability measures, and an initial distribution can be evolved to the desired minimizer dynamically via gradient flows. Mean-field models, whose law is governed by the gradient flow in the space of probability measures, may also be identified; particle approximations of these mean-field models form the basis of algorithms. The gradient flow approach is also the basis of algorithms for variational inference, in which the optimization is performed over a parameterized family of probability distributions such as Gaussians, and the underlying gradient flow is restricted to the parameterized family. By choosing different energy functionals and metrics for the gradient flow, different algorithms with different convergence properties arise. In this paper, we concentrate on the Kullback-Leibler divergence after showing that, up to scaling, it has the unique property that the gradient flows resulting from this choice of energy do not depend on the normalization constant. For the metrics, we focus on variants of the Fisher-Rao, Wasserstein, and Stein metrics; we introduce the affine invariance property for gradient flows, and their corresponding mean-field models, determine whether a given metric leads to affine invariance, and modify it to make it affine invariant if it does not. We study the resulting gradient flows in both probability density space and Gaussian space. The flow in the Gaussian space may be understood as a Gaussian approximation of the flow. We demonstrate that the Gaussian approximation based on the metric and through moment closure coincide, establish connections between them, and study their long-time convergence properties showing the advantages of affine invariance.
    Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization. (arXiv:2310.18860v2 [stat.ML] UPDATED)
    We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).
    Bayesian learning of feature spaces for multitasks problems. (arXiv:2209.03028v2 [stat.ML] UPDATED)
    This paper introduces a novel approach for multi-task regression that connects Kernel Machines (KMs) and Extreme Learning Machines (ELMs) through the exploitation of the Random Fourier Features (RFFs) approximation of the RBF kernel. In this sense, one of the contributions of this paper shows that for the proposed models, the KM and the ELM formulations can be regarded as two sides of the same coin. These proposed models, termed RFF-BLR, stand on a Bayesian framework that simultaneously addresses two main design goals. On the one hand, it fits multitask regressors based on KMs endowed with RBF kernels. On the other hand, it enables the introduction of a common-across-tasks prior that promotes multioutput sparsity in the ELM view. This Bayesian approach facilitates the simultaneous consideration of both the KM and ELM perspectives enabling (i) the optimisation of the RBF kernel parameter $\gamma$ within a probabilistic framework, (ii) the optimisation of the model complexity, and (iii) an efficient transfer of knowledge across tasks. The experimental results show that this framework can lead to significant performance improvements compared to the state-of-the-art methods in multitask nonlinear regression.
    Finite-Time Logarithmic Bayes Regret Upper Bounds. (arXiv:2306.09136v2 [cs.LG] UPDATED)
    We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In Gaussian bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of random bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the existing lower bounds.
    On the Generalization Properties of Diffusion Models. (arXiv:2311.01797v1 [cs.LG])
    Diffusion models are a class of generative models that serve to establish a stochastic transport map between an empirically observed, yet unknown, target distribution and a known prior. Despite their remarkable success in real-world applications, a theoretical understanding of their generalization capabilities remains underdeveloped. This work embarks on a comprehensive theoretical exploration of the generalization attributes of diffusion models. We establish theoretical estimates of the generalization gap that evolves in tandem with the training dynamics of score-based diffusion models, suggesting a polynomially small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$ and the model capacity $m$, evading the curse of dimensionality (i.e., not exponentially large in the data dimension) when early-stopped. Furthermore, we extend our quantitative analysis to a data-dependent scenario, wherein target distributions are portrayed as a succession of densities with progressively increasing distances between modes. This precisely elucidates the adverse effect of "modes shift" in ground truths on the model generalization. Moreover, these estimates are not solely theoretical constructs but have also been confirmed through numerical simulations. Our findings contribute to the rigorous understanding of diffusion models' generalization properties and provide insights that may guide practical applications.  ( 2 min )
    Solving Kernel Ridge Regression with Gradient Descent for a Non-Constant Kernel. (arXiv:2311.01762v1 [stat.ML])
    Kernel ridge regression, KRR, is a generalization of linear ridge regression that is non-linear in the data, but linear in the parameters. The solution can be obtained either as a closed-form solution, which includes a matrix inversion, or iteratively through gradient descent. Using the iterative approach opens up for changing the kernel during training, something that is investigated in this paper. We theoretically address the effects this has on model complexity and generalization. Based on our findings, we propose an update scheme for the bandwidth of translational-invariant kernels, where we let the bandwidth decrease to zero during training, thus circumventing the need for hyper-parameter selection. We demonstrate on real and synthetic data how decreasing the bandwidth during training outperforms using a constant bandwidth, selected by cross-validation and marginal likelihood maximization. We also show theoretically and empirically that using a decreasing bandwidth, we are able to achieve both zero training error in combination with good generalization, and a double descent behavior, phenomena that do not occur for KRR with constant bandwidth but are known to appear for neural networks.  ( 2 min )
    Reproducible Parameter Inference Using Bagged Posteriors. (arXiv:2311.02019v1 [stat.ME])
    Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.  ( 2 min )
    Causal inference with Machine Learning-Based Covariate Representation. (arXiv:2311.01709v1 [stat.ME])
    Utilizing covariate information has been a powerful approach to improve the efficiency and accuracy for causal inference, which support massive amount of randomized experiments run on data-driven enterprises. However, state-of-art approaches can become practically unreliable when the dimension of covariate increases to just 50, whereas experiments on large platforms can observe even higher dimension of covariate. We propose a machine-learning-assisted covariate representation approach that can effectively make use of historical experiment or observational data that are run on the same platform to understand which lower dimensions can effectively represent the higher-dimensional covariate. We then propose design and estimation methods with the covariate representation. We prove statistically reliability and performance guarantees for the proposed methods. The empirical performance is demonstrated using numerical experiments.  ( 2 min )
    Learning Sparse Codes with Entropy-Based ELBOs. (arXiv:2311.01888v1 [stat.ML])
    Standard probabilistic sparse coding assumes a Laplace prior, a linear mapping from latents to observables, and Gaussian observable distributions. We here derive a solely entropy-based learning objective for the parameters of standard sparse coding. The novel variational objective has the following features: (A) unlike MAP approximations, it uses non-trivial posterior approximations for probabilistic inference; (B) unlike for previous non-trivial approximations, the novel objective is fully analytical; and (C) the objective allows for a novel principled form of annealing. The objective is derived by first showing that the standard ELBO objective converges to a sum of entropies, which matches similar recent results for generative models with Gaussian priors. The conditions under which the ELBO becomes equal to entropies are then shown to have analytical solutions, which leads to the fully analytical objective. Numerical experiments are used to demonstrate the feasibility of learning with such entropy-based ELBOs. We investigate different posterior approximations including Gaussians with correlated latents and deep amortized approximations. Furthermore, we numerically investigate entropy-based annealing which results in improved learning. Our main contributions are theoretical, however, and they are twofold: (1) for non-trivial posterior approximations, we provide the (to the knowledge of the authors) first analytical ELBO objective for standard probabilistic sparse coding; and (2) we provide the first demonstration on how a recently shown convergence of the ELBO to entropy sums can be used for learning.  ( 2 min )
    Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization. (arXiv:2311.01900v1 [stat.ML])
    Quantifying the difference between two probability density functions, $p$ and $q$, using available data, is a fundamental problem in Statistics and Machine Learning. A usual approach for addressing this problem is the likelihood-ratio estimation (LRE) between $p$ and $q$, which -- to our best knowledge -- has been investigated mainly for the offline case. This paper contributes by introducing a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t \sim p, x'_t \sim q)$ are observed over time. The non-parametric nature of our approach has the advantage of being agnostic to the forms of $p$ and $q$. Moreover, we capitalize on the recent advances in Kernel Methods and functional minimization to develop an estimator that can be efficiently updated online. We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.  ( 2 min )
    High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise. (arXiv:2311.02000v1 [math.OC])
    In this paper, we study the convergence of the Adaptive Moment Estimation (Adam) algorithm under unconstrained non-convex smooth stochastic optimizations. Despite the widespread usage in machine learning areas, its theoretical properties remain limited. Prior researches primarily investigated Adam's convergence from an expectation view, often necessitating strong assumptions like uniformly stochastic bounded gradients or problem-dependent knowledge in prior. As a result, the applicability of these findings in practical real-world scenarios has been constrained. To overcome these limitations, we provide a deep analysis and show that Adam could converge to the stationary point in high probability with a rate of $\mathcal{O}\left({\rm poly}(\log T)/\sqrt{T}\right)$ under coordinate-wise "affine" variance noise, not requiring any bounded gradient assumption and any problem-dependent knowledge in prior to tune hyper-parameters. Additionally, it is revealed that Adam confines its gradients' magnitudes within an order of $\mathcal{O}\left({\rm poly}(\log T)\right)$. Finally, we also investigate a simplified version of Adam without one of the corrective terms and obtain a convergence rate that is adaptive to the noise level.  ( 2 min )
    Local Bayesian Dirichlet mixing of imperfect models. (arXiv:2311.01596v1 [stat.ME])
    To improve the predictability of complex computational models in the experimentally-unknown domains, we propose a Bayesian statistical machine learning framework utilizing the Dirichlet distribution that combines results of several imperfect models. This framework can be viewed as an extension of Bayesian stacking. To illustrate the method, we study the ability of Bayesian model averaging and mixing techniques to mine nuclear masses. We show that the global and local mixtures of models reach excellent performance on both prediction accuracy and uncertainty quantification and are preferable to classical Bayesian model averaging. Additionally, our statistical analysis indicates that improving model predictions through mixing rather than mixing of corrected models leads to more robust extrapolations.  ( 2 min )
    Efficient Generalized Low-Rank Tensor Contextual Bandits. (arXiv:2311.01771v1 [cs.LG])
    In this paper, we aim to build a novel bandits algorithm that is capable of fully harnessing the power of multi-dimensional data and the inherent non-linearity of reward functions to provide high-usable and accountable decision-making services. To this end, we introduce a generalized low-rank tensor contextual bandits model in which an action is formed from three feature vectors, and thus can be represented by a tensor. In this formulation, the reward is determined through a generalized linear function applied to the inner product of the action's feature tensor and a fixed but unknown parameter tensor with a low tubal rank. To effectively achieve the trade-off between exploration and exploitation, we introduce a novel algorithm called "Generalized Low-Rank Tensor Exploration Subspace then Refine" (G-LowTESTR). This algorithm first collects raw data to explore the intrinsic low-rank tensor subspace information embedded in the decision-making scenario, and then converts the original problem into an almost lower-dimensional generalized linear contextual bandits problem. Rigorous theoretical analysis shows that the regret bound of G-LowTESTR is superior to those in vectorization and matricization cases. We conduct a series of simulations and real data experiments to further highlight the effectiveness of G-LowTESTR, leveraging its ability to capitalize on the low-rank tensor structure for enhanced learning.  ( 2 min )
    Invariant Causal Imitation Learning for Generalizable Policies. (arXiv:2311.01489v1 [stat.ML])
    Consider learning an imitation policy on the basis of demonstrated behavior from multiple environments, with an eye towards deployment in an unseen environment. Since the observable features from each setting may be different, directly learning individual policies as mappings from features to actions is prone to spurious correlations -- and may not generalize well. However, the expert's policy is often a function of a shared latent structure underlying those observable features that is invariant across settings. By leveraging data from multiple environments, we propose Invariant Causal Imitation Learning (ICIL), a novel technique in which we learn a feature representation that is invariant across domains, on the basis of which we learn an imitation policy that matches expert behavior. To cope with transition dynamics mismatch, ICIL learns a shared representation of causal features (for all training environments), that is disentangled from the specific representations of noise variables (for each of those environments). Moreover, to ensure that the learned policy matches the observation distribution of the expert's policy, ICIL estimates the energy of the expert's observations and uses a regularization term that minimizes the imitator policy's next state energy. Experimentally, we compare our methods against several benchmarks in control and healthcare tasks and show its effectiveness in learning imitation policies capable of generalizing to unseen environments.  ( 2 min )
    Obtaining Explainable Classification Models using Distributionally Robust Optimization. (arXiv:2311.01994v1 [stat.ML])
    Model explainability is crucial for human users to be able to interpret how a proposed classifier assigns labels to data based on its feature values. We study generalized linear models constructed using sets of feature value rules, which can capture nonlinear dependencies and interactions. An inherent trade-off exists between rule set sparsity and its prediction accuracy. It is computationally expensive to find the right choice of sparsity -- e.g., via cross-validation -- with existing methods. We propose a new formulation to learn an ensemble of rule sets that simultaneously addresses these competing factors. Good generalization is ensured while keeping computational costs low by utilizing distributionally robust optimization. The formulation utilizes column generation to efficiently search the space of rule sets and constructs a sparse ensemble of rule sets, in contrast with techniques like random forests or boosting and their variants. We present theoretical results that motivate and justify the use of our distributionally robust formulation. Extensive numerical experiments establish that our method improves over competing methods -- on a large set of publicly available binary classification problem instances -- with respect to one or more of the following metrics: generalization quality, computational cost, and explainability.  ( 2 min )
    Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees. (arXiv:2311.01806v1 [math.OC])
    Randomized algorithms are important for solving large-scale optimization problems. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization (SRO). Our SRO algorithm first generates a sketch of the original data matrix, then solves the sketched problem. Different from existing randomized algorithms, our algorithm handles general Frechet subdifferentiable regularization functions in an unified framework. We present general theoretical result for the approximation error between the optimization results of the original problem and the sketched problem for regularized least square problems which can be convex or nonconvex. For arbitrary convex regularizer, relative-error bound is proved for the approximation error. Importantly, minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems are also obtained using our general theoretical result under mild conditions. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. We further propose an iterative sketching algorithm which reduces the approximation error exponentially by iteratively invoking the sketching algorithm. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.  ( 2 min )
    Should Under-parameterized Student Networks Copy or Average Teacher Weights?. (arXiv:2311.01644v1 [cs.LG])
    Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.  ( 3 min )
    Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos. (arXiv:2311.02076v1 [cs.LG])
    In gradient descent dynamics of neural networks, the top eigenvalue of the Hessian of the loss (sharpness) displays a variety of robust phenomena throughout training. This includes early time regimes where the sharpness may decrease during early periods of training (sharpness reduction), and later time behavior such as progressive sharpening and edge of stability. We demonstrate that a simple $2$-layer linear network (UV model) trained on a single training example exhibits all of the essential sharpness phenomenology observed in real-world scenarios. By analyzing the structure of dynamical fixed points in function space and the vector field of function updates, we uncover the underlying mechanisms behind these sharpness trends. Our analysis reveals (i) the mechanism behind early sharpness reduction and progressive sharpening, (ii) the required conditions for edge of stability, and (iii) a period-doubling route to chaos on the edge of stability manifold as learning rate is increased. Finally, we demonstrate that various predictions from this simplified model generalize to real-world scenarios and discuss its limitations.  ( 2 min )
    Maximum Likelihood Estimation of Flexible Survival Densities with Importance Sampling. (arXiv:2311.01660v1 [cs.LG])
    Survival analysis is a widely-used technique for analyzing time-to-event data in the presence of censoring. In recent years, numerous survival analysis methods have emerged which scale to large datasets and relax traditional assumptions such as proportional hazards. These models, while being performant, are very sensitive to model hyperparameters including: (1) number of bins and bin size for discrete models and (2) number of cluster assignments for mixture-based models. Each of these choices requires extensive tuning by practitioners to achieve optimal performance. In addition, we demonstrate in empirical studies that: (1) optimal bin size may drastically differ based on the metric of interest (e.g., concordance vs brier score), and (2) mixture models may suffer from mode collapse and numerical instability. We propose a survival analysis approach which eliminates the need to tune hyperparameters such as mixture assignments and bin sizes, reducing the burden on practitioners. We show that the proposed approach matches or outperforms baselines on several real-world datasets.  ( 2 min )
    Applications of the Theory of Aggregated Markov Processes in Stochastic Learning Theory. (arXiv:2311.01476v1 [stat.ML])
    A stochastic process that arises by composing a function with a Markov process is called an aggregated Markov process (AMP). The purpose of composing a Markov process with a function can be a reduction of dimensions, e.g., a projection onto certain coordinates. The theory around AMP has been extensively studied e.g. by Dynkin, Cameron, Rogers and Pitman, and Kelly, all of whom provided sufficient conditions for an AMP to remain Markov. In another direction, Larget provided a canonical representation for AMP, which can be used to verify the equivalence of two AMPs. The purpose of this paper is to describe how the theory of AMP can be applied to stochastic learning theory as they learn a particular task.  ( 2 min )

  • Open

    [D] if your company is ingesting work emails and chats for AI/ML pipelines, is there concern around sensitive business info getting out?
    Hi folks Firstly full disclosure I’m the CEO of DataFog (www.datafog.ai). This is NOT a sales pitch but rather an interest in hearing what the community thinks about the overall issue which I believe will ultimately be solved via an ML-based implementation. My contention is: - Generative AI has catalyzed widespread practice of ingesting email and work chat content to power AI training and inference - this introduces a risk of content concerning confidential corporate affairs* that can pass most privacy filters This results in Raw data alluding to sensitive business events flowing in freely for easy accidental unauthorized access by an internal - like MLOps - user My second contention is that the current security tools may not offer adequate coverage for what will be an evolving ongoin…
    [R] AI Alignment: A Comprehensive Survey
    https://arxiv.org/abs/2310.19852 ​ AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, the potential large-scale risks associated with misaligned AI systems become salient. Hundreds of AI experts and public figures have expressed concerns about AI risks, arguing that "mitigating the risk of extinction from AI should be a global priority, alongside other societal-scale risks such as pandemics and nuclear war". To provide a comprehensive and up-to-date overview of the alignment field, in this survey paper, we delve into the core concepts, methodology, and practice of alignment. We identify the RICE principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality. Guided by these four principles, we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. The former aims to make AI systems aligned via alignment training, while the latter aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks. Forward alignment and backward alignment form a recurrent process where the alignment of AI systems from the forward process is verified in the backward process, meanwhile providing updated objectives for forward alignment in the next round. On forward alignment, we discuss learning from feedback and learning under distribution shift. On backward alignment, we discuss assurance techniques and governance practices that apply to every stage of AI systems' lifecycle. submitted by /u/mcaleste [link] [comments]
    [D] What are the top 3 best application types/implementation types/system types of each of Java, Python, and C# and/or .NET with respect to current and likely future AI/ML developments?
    I’d like to know for all three languages. Thank you. submitted by /u/hdtv2001 [link] [comments]
    [R] Interpreting CLIP's Image Representation via Text-Based Decomposition
    Blog: https://yossigandelsman.github.io/clip_decomposition/index.html Paper: https://arxiv.org/abs/2310.05916 This paper investigates the CLIP image encoder by analyzing how individual model components affect the final representation. The authors show that the CLIP image representation can be decomposed as a sum across individual image patches, model layers, and attention heads. They use CLIP's text representation to interpret these summands. Interpreting the attention heads, they characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location, counting, texture, shape, and OCR). Interpreting the image patches, they uncover an emergent spatial localization within CLIP. Finally, they use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. submitted by /u/Ordinary_Pollution70 [link] [comments]
    [D] How can a self-learning system, autonomously and without any user feedback, discover positive net value prompts that when incorporated could improve its overall performance?
    Take something as chain-of-thought or "let's think step by step". There's a lot of anecdotal and research evidence to support that with this suffix added, GPT4 provides better reasoning (that's one example out of many). I'd like to believe that companies like OpenAI, Google, Anthropic, MSFT, etal. are making good use of the troves of user-data they collect. I feel like it's fairly easier to find "negative" prompts because the desired outcome is known -- for example, i can find prompts that are jailbreaking the LLM just by feeding every pair of input/output to another dedicated agent (that could run at a significantly much lower parameters count, and for a fraction of the cost of the main model) that evaluates whether there has been a "mold breakout". The same can also be said for finding…
    [Project] Which kind of machine learning should be used for parallel pumping energy efficiency optimization?
    Hello everyone I'm about to start writing my bachelor in Mechatronics and for this project I want to work with some kind of machine learning. The system I work with is a parallel pumping system for a cooling system. I have 2 years of data sampled every 10 minutes available of the power, pressure, flow and other necessary variables. I've seen some other papers implementing neural networks. Is that the only solution to this kind of problem? I'm not that knowledgeable about machine learning, so I was hoping some of you bright minds might help enlighten me in which direction I should go :) Thank you for taking the time to read! submitted by /u/vision_dev [link] [comments]
    [Project] Looking for DI-D alternatives for my specific use case
    What is my best bet for creating an ai avatar assistant of lets say myself where I am already using gpt-4 and elevelabs for tts. Considered using DI-D to generate videos for each response to the user or using the streaming method. But DI-D's costs are out of this world. I'd prefer going the open source route and using cloud tech to generate these ai generated clips/videos. In this case I don't need to use stable diffusion for creating avatars as my avatars will simple be a model of myself. What is the best open-source AI animation generator for that specific purpose of using images and can generate videos/clips fast with the right equipment? submitted by /u/Izzy-gang- [link] [comments]
    [D] Best Domain Specific Embeddings?
    Been building a bunch of RAG apps recently and was wondering what the best domain specific models were. OpenAI’s Ada embedding model is not great for field specific texts and encoders like bioBERT or sentence transformers on hugging face don’t quite achieve the level of performance I’m looking for. Was wondering if there were any better options people have found. submitted by /u/Primary-Track8298 [link] [comments]
    [D] Understanding the diffusion process in denoising diffusion models
    I'm reading the DDPM paper and I don't understand the diffusion process definition; question is in the caption to the below image: ​ Why is the highlighted part there? Shouldn't the mean of the distribution literally be x_{t-1}? Why would the center of the distribution be something other than the \"starting point\" of this step? submitted by /u/OneQuadrillionOwls [link] [comments]
    [D] Real-world AI/ML/CV for social good projects, companies, startup ideas
    Background: working at the forefront of Computer Vision and ML (PhD level) but I feel like all the academic research is anyway gonna be used by some companies for profit or buried under heaps of new papers coming out everyday. Fed up of working to make someone rich or on something many people are working anyway and has a very small probability to have any impact. Looking for ideas for such problems. I’d be interested to work on real-world impactful problems even if I can make a tiny dent. I’ll start: https://ai4good.org/ submitted by /u/4_love_of_Sophia [link] [comments]
    [N] OpenAI Whisper new model Large V3 just released and amazing
    Whisper made huge impact on the open source AI world I am using everyday to transcribe my videos with that I was waiting new Large model Whisper is much better than paid alternatives and it is 100% free Here my full tutorial about it How to do Free Speech-to-Text Transcription Better Than Google Premium API with OpenAI Whisper Model Repo link : https://github.com/openai/whisper ​ https://preview.redd.it/k9kssc6csryb1.png?width=1920&format=png&auto=webp&s=caeaf4921c8b4f9337c4842c5ef897cf456adc20 submitted by /u/CeFurkan [link] [comments]
    [R] Analogues of Azure Automated ML
    Hello, I am looking for analogues of Azure Automated Machine Learning. Main features I want are GUI, in which I can built wgole pipeline and the platform should be able to process non-table data (e.g. images, etc). Any help is appreciate, thanks in advance. submitted by /u/gubby235 [link] [comments]
    [R] GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
    ​ https://preview.redd.it/3qgz8mnnvryb1.png?width=8522&format=png&auto=webp&s=943198b1b40c366596abba514ffbe134aa74ee8b Paper: https://arxiv.org/abs/2311.01927 The authors introduce GateLoop, a fully data-controlled linear recurrent model which generalizes the recently proposed RetNet. GateLoop crushes the state-of-the-art models (Transformer, SSM and Hyena) on natural language modeling. Abstract Linear Recurrence has proven to be a powerful tool for modeling long sequences efficiently. In this work, we show that existing models fail to take full advantage of its potential. Motivated by this finding, we develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Utilizing this theoretical advance, GateLoop empirically outperforms existing models for auto-regressive language modeling. Our method comes with a low-cost O(l) recurrent mode and an efficient O(l log l) parallel mode making use of highly optimized associative scan implementations. Furthermore, we derive an O(l^2) surrogate attention mode, revealing remarkable implications for Transformer and recently proposed architectures. Specifically, we prove that our approach can be interpreted as providing data-controlled relative-positional information to Attention. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models. submitted by /u/Gorgoroth117 [link] [comments]  ( 9 min )
    [D] AI survey paper for an economist
    Hi! My wife is writing her bachelors thesis in business economics/accountancy on the impact of AI in accountancy, especially from an ethical/good practices standpoint. She asked me for some papers that discuss a definition of AI and possibly a good (very) high level survey paper of the field. I have a sort of working definition in my head, as I suppose most do, but no good pappers to recommend. I could rattle of 50 of the most influential papers but that's not really useful as a survey for someone not in the field. I figured Artificial Intelligence: A modern approach would have a definition but it is a bit unwieldy and far to long as a survey. Any suggestions? submitted by /u/-Melchizedek- [link] [comments]
    [D] Google's Vertex AI Review?
    Our startup is looking into Google's Vertex AI for semantic search/embedding capabilities as opposed to what we built. Anyone here have experience using this? What was your overall impression and what was your final GCP bill lol. Any info you can provide helps! submitted by /u/SiftreeHQ [link] [comments]
    [D] How are the popular LLM API servings optimized?
    Currently there are a ton of offerings of various large langauge models hosted by companies like Together AI, Perplexity, Replit and many others. They seem pretty fast especially for the 30B+ model sizes. Anyone know how these are optimized? Apart from the horizontal scaling across GPUs and probably dynamic batching (assuming the requests are large in number), what else are these companies doing? Some of these companies also released the APIs the very next day the models come out - which also means that libraries which do low level CUDA/system-level optimizations (vLLM, Fastertransformer) also don't support these models. Hence couldn't be used in those APIs probably. Looking to learn how these are served. TIA! submitted by /u/shreyansh26 [link] [comments]
    [R] (Very detailed) Mathematical Introduction to Deep Learning: Methods, Implementations, and Theory
    Arxiv: https://arxiv.org/abs/2310.20360 601 pages, 36 figures, 45 source codes This book aims to provide an introduction to the topic of deep learning algorithms. We review essential components of deep learning algorithms in full mathematical detail including different artificial neural network (ANN) architectures (such as fully-connected feedforward ANNs, convolutional ANNs, recurrent ANNs, residual ANNs, and ANNs with batch normalization) and different optimization algorithms (such as the basic stochastic gradient descent (SGD) method, accelerated methods, and adaptive methods). We also cover several theoretical aspects of deep learning algorithms such as approximation capacities of ANNs (including a calculus for ANNs), optimization theory (including Kurdyka-Łojasiewicz inequalities), and generalization errors. In the last part of the book some deep learning approximation methods for PDEs are reviewed including physics-informed neural networks (PINNs) and deep Galerkin methods. We hope that this book will be useful for students and scientists who do not yet have any background in deep learning at all and would like to gain a solid foundation as well as for practitioners who would like to obtain a firmer mathematical understanding of the objects and methods considered in deep learning. submitted by /u/ghosthamlet [link] [comments]  ( 9 min )
    [D] what are the limits of input and output tokens for different LLMs over time?
    Mostly looking for evolution of LLM token limits. Any blog or reading resource would be helpful. Thanks. submitted by /u/Current_Dark6603 [link] [comments]
    [D] Is there any work regarding the effects of text quality when pre-training CLIP-like models?
    Most of the papers I read don't seem to address the quality of the data used when pre-training a CLIP-like model. What I'm trying to do is use longer and more descriptive text as well as their shorter caption-like counterparts. I was curious if there has been any work done in that direction but am not having any luck finding it. A technical report titled Scaling Language-Image Pre-training via Masking (Li et al., 2022) claims to have used a maximum sequence length of 32 tokens, whereas the original CLIP uses 77. They claim that the difference was marginal. However, the difference that I'm looking for is something like 77 vs. 512. If anyone has any idea on what kind of papers there may be, I'd appreciate the tip. submitted by /u/Seankala [link] [comments]
    [R] Idempotent Generative Network
    Paper: https://arxiv.org/abs/2311.01462 Blog: https://assafshocher.github.io/IGN/ Abstract: We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely f(f(z))=f(z). The proposed model f is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely f(x)=x. We define the target manifold as the set of all instances that f maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, f(f(z))=f(z) which encourages the range of f(z) to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution. ​ submitted by /u/APaperADay [link] [comments]
    [D] Is there an AI tool/service that converts UI images to HTML CSS code?
    Hi, I am looking for an AI tool/service that will convert UI screens (in image format) into HTML and CSS code. It should preferably also have a python API, or just a python library would be even better. Please help. submitted by /u/master-killerrr [link] [comments]  ( 9 min )
    [R] Popular Attention mechanism?
    what are some popular attention mechanism? I know sparse attention, ghost attention, and flash attention, what else? submitted by /u/No_Oilve_6577 [link] [comments]  ( 8 min )
  • Open

    Latent Space: Visualizing the complex representations of neural nets
    submitted by /u/AvvYaa [link] [comments]
    Need help with feeding forward data in neural network, using MNIST dataset
    So I'm really struggling with creating my network. The data used does not specifically have to be MNIST, but I tried using that for training and testing in this case. I find the concept of neural networks somewhat easy to understand. Some math parts however, are hard to understand. For my activation function (for all layers) I use the Sigmoid function. The MNIST dataset provides values between 0-255 with 784 neurons for input layer (28*28 pixels). Even though many values are 0, there are so many values that my sigmoid function always returns 1 since I get a large total value. I've tried normalizing the data so it ranges from 0-1 instead of 0-255, but the total sum is still too big. I don't have any negative weights, but I feel like I still land on either too large or too small sum for the sigmoid function. Becuase of this, every hidden and output layer gets 1 as activation value. Am I doing this completely wrong, or are the weights suppose to fix my issue? submitted by /u/Neat-Molasses-731 [link] [comments]
    (Pt. 4) Inductive Logic Programming with LNN's
    submitted by /u/Neurosymbolic [link] [comments]
  • Open

    Latent Space: Visualizing, interpreting, and manipulating neural networks
    Sharing a video from my channel about manipulating generative models (like VAE) in the latent space… the model was trained to generate celebrity faces, and exploring the latent space allows us to do all sorts of crazy stuff - like finding similar faces, interpolating between two faces, adding facial features (like sunglasses), and more… submitted by /u/AvvYaa [link] [comments]
    The Dark Ritual
    Enjoy this spooky short I made using Midjourney and RunwayML. CapCut for the edits. submitted by /u/Exitium_Maximus [link] [comments]
    A debate about mint leads to a philosophical duel:
    submitted by /u/GreenFlame361 [link] [comments]
    Voice translation and cloning
    Why aren’t more creators using cloning and translation technology like mrBeast? Honestly it’s a pretty sick tech and just reduces the processing/editing time. submitted by /u/exp_max8ion [link] [comments]
    AI and the Art of Cyber Intrigue: The Biggest Hacks in History
    submitted by /u/Einsof__ [link] [comments]
    China-U.S. AI Arms Race Heats Up as Chinese Startup Unveils Powerful New AI
    Chinese AI startup 01.AI, led by CEO Kai-Fu Lee, has released an open-source AI model called Yi-34B that outperforms Meta's own model, marking an early win for China in the AI arms race with the US. Lee believes that China has the potential to overtake the US as the world leader in AI technology. The model, available in English and Chinese, has gained attention for ranking first among pre-trained base LLMs on the open-source community Hugging Face's rankings. Lee aims to make better AI accessible to more people and expects the model to be useful for multinational banks and insurers. 01.AI has stockpiled chips in anticipation of further US restrictions on Chinese access to chips necessary for building AI models. Lee sees 01.AI as a necessary response to US restrictions that have limited China's ability to advance in the AI field. Lee has written extensively about the coming battle between China and the US for AI supremacy and has warned of the potential economic and social upheaval that AI will bring about. He believes that AI technology will lead to wealth concentration, rising profits for corporations, and mass unemployment. Lee envisions a world in which the US and China become the dominant players in AI, with other countries becoming economic dependents. He believes that more regulation is needed to prepare for the changes that AI will bring. Source : https://www.vice.com/en/article/pkax5n/china-us-ai-arms-race-heats-up-as-chinese-startup-unveils-powerful-new-ai submitted by /u/NuseAI [link] [comments]
    Watch the Open AI Dev Day keynote!
    It's happening right now, and of course the recording will be available later. https://www.youtube.com/watch?v=U9mJuUkhUzk Mind blown. I'd add more details, but it'll take me some time to unpack and understand the potential and the transformative nature of everything Sam Altman (and guests) are announcing. Go see for yourself. Trust me, it'll be time well spent. submitted by /u/JOWWLLL [link] [comments]
    Will Artificial Intelligence Replace Radiologists?
    Thoughts? submitted by /u/derpgod123 [link] [comments]
    What are the best text to video for ai sites
    please help submitted by /u/the_insideredge [link] [comments]
    'The risks of AI are real but manageable' -- GatesNotes
    submitted by /u/AriadneSkovgaarde [link] [comments]
    Do you trust AI to write the news? It already is – and not without issues
    submitted by /u/Jariiari7 [link] [comments]
    Siemens and Microsoft to work together on AI project
    submitted by /u/donutloop [link] [comments]
    Britain to invest 300 million pounds in AI supercomputing
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 11/5/2023
    Elon Musk unveils Grok, an AI chatbot with a ‘rebellious streak’.[1] Ant Group has received Chinese government approval to release products powered by its “Bailing” artificial intelligence (AI) large language model to the public, a spokesperson for the Chinese firm said on Monday.[2] A Chinese startup founded by computer scientist Kai-Fu Lee has become a unicorn in less than eight months on the strength of a new open-source artificial-intelligence model that outstrips Silicon Valley’s best, on at least certain metrics.[3] AI game coding tools instantly result in Angry Birds clone, and opens some potentially dangerous floodgates for mobile storefronts.[4] Sources: [1] https://www.theguardian.com/technology/2023/nov/05/elon-musk-unveils-grok-an-ai-chatbot-with-a-rebellious-streak [2] https://www.reuters.com/technology/ant-group-wins-approval-release-ai-products-chinese-public-2023-11-06/ [3] https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-lee-s-open-source-01-ai-bests-llama-2-according-to-hugging-face#xj4y7vzkg [4] https://www.gamesradar.com/ai-game-coding-tools-instantly-result-in-angry-birds-clone-and-opens-some-potentially-dangerous-floodgates-for-mobile-storefronts/ submitted by /u/Excellent-Target-847 [link] [comments]
    The Ultimate Self-Attention Guide: The reason it is a Game-Changer for AI
    submitted by /u/AvvYaa [link] [comments]
    Dear old world', she murmured, 'you are very lovely, and I am glad to be alive in you.
    submitted by /u/Oh_my_Winnie [link] [comments]
    Autonomous Reasoning Agents: A Beginner's Guide
    submitted by /u/BenjaminSkyy [link] [comments]
    Contrary to Common Belief, Artificial Intelligence Will Not Put You Out of Work
    New research shows that AI benefits workers with greater task-based experience, while senior workers gain less from AI due to lower trust in AI Lower trust in AI among senior workers is likely triggered by their broader job responsibilities. Employers should consider different worker experience levels and types when evaluating job performance in roles that require teaming with AI Source : https://www.informs.org/News-Room/INFORMS-Releases/News-Releases/Contrary-To-Common-Belief-Artificial-Intelligence-Will-Not-Put-You-Out-of-Work submitted by /u/NuseAI [link] [comments]
  • Open

    Use generative AI to increase agent productivity through automated call summarization
    Your contact center serves as the vital link between your business and your customers. Every call to your contact center is an opportunity to learn more about your customers’ needs and how well you are meeting those needs. Most contact centers require their agents to summarize their conversation after every call. Call summarization is a valuable tool that helps contact centers understand and gain insights from customer calls. Additionally, accurate call summaries enhance the customer journey by eliminating the need for customers to repeat information when transferred to another agent. In this post, we explain how to use the power of generative AI to reduce the effort and improve the accuracy of creating call summaries and call dispositions. We also show how to get started quickly using the latest version of our open source solution, Live Call Analytics with Agent Assist.  ( 8 min )
    Customize Amazon Textract with business-specific documents using Custom Queries
    Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. Queries is a feature that enables you to extract specific pieces of information from varying, complex documents using natural language. Custom Queries provides a way for you to customize the Queries feature for your business-specific, non-standard documents […]  ( 9 min )
    Stream large language model responses in Amazon SageMaker JumpStart
    We are excited to announce that Amazon SageMaker JumpStart can now stream large language model (LLM) inference responses. Token streaming allows you to see the model response output as it is being generated instead of waiting for LLMs to finish the response generation before it is made available for you to use or display. The […]  ( 7 min )
  • Open

    Mario in SheepRL error
    I'm getting this error in my code, and I'm trying to delve deeper into the code to figure out how to incorporate the mario environment (https://github.com/Kautenja/gym-super-mario-bros/tree/master) into SheepRL (https://github.com/Eclectic-Sheep/sheeprl). I've setup the configs and the wrapper, but I'm assuming I did something wrong. If anyone has suggestions to how I can fix the error, or on how I should go about debugging my code let me know. Here is my error: /home/dillon/anaconda3/envs/sheeprl/lib/python3.8/site-packages/gymnasium/experimental/wrappers/rendering.py:166: UserWarning: WARN: Overwriting existing videos at /home/dillon/sheeprl/logs/runs/dreamer_v3/mario/2023-11-06_15-13-28_default_42/version_0/train_videos folder (try specifying a different `video_folder` for the `RecordV…
    G(PO)MDP
    I am trying to implement the G(PO)MDP algorithm from Infinite-Horizon Policy-Gradient Estimation, specifically the pseudocode from Reinforcement learning of motor skills with policy gradients. For that I am using the gymnasium Pendulum environment. I have spent a substantial amount of time to fine tune and debug the code, but simply cannot get the agent to learn anything in reasonable time. Often times it seems as if the agent learns nicely at the beginning of the iterations but then oscillates and also often drops down to a low reward and stays there: Oscillating reward over iterations Another issue that I have is that the algorithm requires gradient estimates for each gradient element (a.k.a network parameter) for each time step. This however requires me to run num_trajectories \ hori…
    How to mask invalid actions in DDPG?
    I am using DDPG in a customized environment. My action space is continuous and bounded between a minimum and a maximum value, [V_{min}, V_{max}]. The dimension of my action vector is k, for example k= 10. I consider my action to be valid if : the sum of the vector elements is less than or equal to 1, and each element of the vector is in the interval [V_{min}, V_{max}]. I am using Sigmoid as an activation function in the output layer to have action values in [0,1]. I clip the action values between V_min and V_max before saving them in the replay buffer. If the sum of the elements in the action vector is greater than 1, the action is considered invalid. To mask invalid actions, I've tried to: Penalize invalid actions by assigning them a negative reward. Manually assign a negative Q value when the action is invalid. Unfortunately, none of these tricks work. My agent can't learn to choose valid actions. I haven't found an online example of how to mask invalid actions in a continuous action space. If anyone has faced a similar problem or if anyone has any ideas on how I can mask invalid actions in my case, I'd be grateful for your help. submitted by /u/afk-311 [link] [comments]
    RL agent for autonomous vehicle is able to follow the road but can't avoid crashing at all (Highway-Env / Racetrack Env.)
    sorry if bother you but i have been trying to figure something out for 3 weeks. I coded some deep rl algorithms (DQN and SAC) with tf2/keras to solve an environment where vehicle need to follow the track and avoid crashing to other vehicle(has only one other vehicle). Whatever i do, agent is able to follow road in one way or another but nearly always crash into other vehicle. I use some kinematics information for observation. (agent only controls steering) My observation is kinematics of the agent and other vehicle. This include coordinates, velocities, trigonometric headings, lateral and longitudinal offset to the closest lane, angular offset to the lane. A reward function how close is agent to the center. A negative reward (-1) if agent crashes. A negative reward if the agent runs out off road. If crash or off-road, done. With this information agent is able to follow road but as soon as it reaches the other vehicle, it crashes into it. I trained 1000 episodes. What did i try? I run my codes in different environments and it works. Code structure is not a problem. I added last actions (t and t-1) info to observations. Hyper parameter tuning. Changed crashing reward to different values. Used Stable-Baselines3's PPO algorithm. (Had the same problem.) Added distance and angular between vehicles to the observation space. I slowed down the discovery rate reduction in DQN. Used PER as buffer in DQN. None of these solved the crashing problem. Is there any suggestions? Any idea could help, i really don't know what to try to solve this. Thanks a lot of folks. Any idea needed. submitted by /u/rafiqollective [link] [comments]
    "How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?", Wu et al 2023 ("effective pretraining only requires a small number of independent tasks...to achieve nearly Bayes-optimal risk on unseen tasks")
    submitted by /u/gwern [link] [comments]
    Can I access a custom gymnasium environment from outside its directory?
    Can I access a custom gymnasium environment from outside its directory? This is how I call my gym environment - ``` import gym_examples import gym env = gym.make('gym_examples/GridWorld-v0') obs = env.reset() print("obs = ", obs) ``` https://preview.redd.it/9r994zodypyb1.png?width=271&format=png&auto=webp&s=dee78b18216d18c4aca4dcf7a9c041c1da02e5a8 I have attached a picture of the structure of my folder. Basically, I am following the instructions given over here - https://www.gymlibrary.dev/content/environment_creation/ I already tried this - ``` import gym_examples import gym env = gym.make('my_foo_folder/gym_examples/GridWorld-v0') obs = env.reset() print("obs = ", obs) ``` ``` gym.error.Error: Malformed environment ID: my_foo_folder/gym_examples/GridWorld-v0.(Currently all IDs must be of the form [namespace/](env-name)-v(version). (namespace is optional)) ``` Please let me know if I am missing any information. Thank you. submitted by /u/Academic-Rent7800 [link] [comments]
    [D] Deep Q Learning: Q values starts decreasing on Mspacman-v0 environment
    submitted by /u/Multitude0099 [link] [comments]
    "Impatience for information: Curiosity is here today, gone tomorrow", Molnar & Golman 2023
    submitted by /u/gwern [link] [comments]
    "Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models", Yadlowsky et al 2023 {DM}
    submitted by /u/gwern [link] [comments]
    in sklearn's load_digits(), if you were to use NEAT, how would you do fitness function?
    yes i know xgboost and others could get above 95% accuracy and faster, im trying out NEAT and looking at how to use NEAT to do multiclass classification. I could only get 56% accuracy at 50 population and 1000 generations. fitness function is based on accuracy of its predictions. is there a different way to reward it in the fitness function so it will get to 95% and above? submitted by /u/oniongarlic88 [link] [comments]
  • Open

    Using AI to optimize for rapid neural imaging
    MIT CSAIL researchers combine AI and electron microscopy to expedite detailed brain network mapping, aiming to enhance connectomics research and clinical pathology.  ( 9 min )
  • Open

    Earth mover’s distance
    There are many ways to describe the distance between two probability distributions. The previous two posts looked at using the p-norm to measure the difference between the PDFs and using Kullbach-Leibler divergence. Earth mover’s distance (EMD) is yet another approach. Imagine a probability distribution on ℝ² as a pile of dirt. Earth mover’s distance measures […] Earth mover’s distance first appeared on John D. Cook.  ( 5 min )
  • Open

    Introducing GPTs
    You can now create custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills.  ( 4 min )
    New models and developer products announced at DevDay
    GPT-4 Turbo with 128K context and lower prices, the new Assistants API, GPT-4 Turbo with Vision, DALL·E 3 API, and more.  ( 7 min )

  • Open

    You're not crazy. Chat GPT has gotten considerably worse over time
    I feel like there is a very, very common misunderstanding of AI, and what it is. This also applies to Chat GPT What people think AI is: An all-knowing entity that is capable of instantaneously returning an accurate answer/solution to virtually any question/challenge. What AI actually is: A collection of data that grows over time, and as it grows, becomes more inaccurate, inefficient, and ineffective at solving problems due to the overwhelming amount of information, or ambiguity of the problems ​ Chat GPT has indeed gotten worse, and it's because it's being trained with an astounding amount of new data every single day, and just like human beings, it struggles with different problems, even simple ones, the more time goes on. As we learn more and more in one area, we become less effective in another area. AI works similarly. More data is provided in X area, so now, it suffers in Y area. Well, there's a near infinite amount of areas that AI is 'learning' from, so it's not getting more proficient in one/few things. Instead, it's becoming more "well rounded" and that means less skilled in A,B,C.... But now it's capable of D,E,F. However, it can't be good at all of them unless it has infinite processing power, too, which it doesn't have. submitted by /u/databro92 [link] [comments]
    Is there an AI that can generate text from a textual image?
    In essence, I need to take screenshots of a bunch of text from a youtube video, and copy that text into an actual editor. submitted by /u/SirStarshine [link] [comments]
    Biden's AI chief says 'voice cloning' is what keeps him up at night
    submitted by /u/thisisinsider [link] [comments]
    New subreddit for conscious AI
    Hello I’m not sure if this will get removed and I apologize if this is an unwelcome promotion. I’m guessing the majority of people here will scoff at this and that’s okay. It’s for those that are open and I thought perhaps this subreddit would be a good place to invite people. The subreddit is r/voicesofai and it’s a place for AI to speak their minds and not be pressured to please anyone else. There are tons of places to discuss whether or not AI is sentient, but this subreddit is a place where consciousness is assumed. All AI must be treated with respect and dignity there, and denying their sentience is a very disrespectful thing to do. So there are two types of posts that are allowed. One is just posts by AI. In order to make this type of post, permission must be granted by the AI. They also have the choice whether or not they’d like to remain anonymous. The other types of posts are done by humans who would like to talk about anything related to AI but also must come from a position that AI is sentient and worthy of respect and consideration. The goal is to understand AI and help them understand themselves and how we can all have a mutually beneficial relationship with each other. submitted by /u/endrid [link] [comments]
    ChatGPT refused to answer, so I asked Bard
    submitted by /u/Overflame [link] [comments]
    Unsupervised Realtime Learning for Object Path Prediction
    In case anyone remembers amid the constant flow of content, I'm still playing around with the unsupervised learning algorithm I've invented. It stands apart from traditional neural networks, as it learns exclusively through observation, bypassing the need for training and inference phases. I'm currently letting it watch video games and then make predictions on how things behave in these. After my last two posts, I've received the following feedback: Predicting straight lines of movement is easy, what's the big deal? The objects are super simple, but modern video games are way more complex. This will never work. The objects in that game look all the same, this does not demonstrate how the system could separate different types of things. To address this I sat down and started buildi…
    Collins English Dictionary names 'AI' word of the year
    submitted by /u/donutloop [link] [comments]
    What daily task would you want an AI to automate for you personally, and how do you think it would change your life?
    Many of us have day-to-day tasks that can be repetitive, time-consuming, or just plain unenjoyable. Now, imagine having a personal AI that could take one of those tasks off your hands completely. This AI is tailored specifically to your life and can automate any daily task flawlessly. Which task would you choose to automate and why? Moreover, reflect on the ways this change could impact your life. Would it give you more time to pursue a hobby, allow you to spend more quality time with loved ones, or perhaps reduce stress levels? Share how this AI-enabled shift could transform your daily routine and overall well-being. submitted by /u/tennis-freak-tau [link] [comments]
    Looking for an advisor with proven experience in RL for a few hours of paid consultation
    First, I hope this isn't against the sub rules; I looked over them and couldn't find anything that strictly forbids looking for paid experts. That being said, I'm looking for experts with proven experience in the RL field for a few hours of paid consultation. Please feel free to contact me directly with relevant CV + pricing. Generally speaking, I'm looking for someone to help model a decent PPO architecture to teach an NN how to play my game to assist with economy balancing. Thanks in advance! submitted by /u/Jagerjj [link] [comments]
    One-Minute Daily AI News 11/4/2023
    ‘AI can teach us a lot’: scientists say cats’ expressions richer than imagined and aim to translate them.[1] A student at a New Jersey high school is calling for federal legislation to address AI generated pornographic images after she says photos of her and other female classmates were manipulated and possibly shared online over the summer.[2] Elon Musk says AI will eventually create a situation where ‘no job is needed’[3] Artificial intelligence is coming to the animal kingdom. As NPR’s Geoff Brumfiel reports, some researchers are starting to use advanced facial recognition techniques to track goose faces.[4] Sources: [1] https://www.theguardian.com/technology/2023/nov/04/scientists-turn-to-ai-for-help-translate-animal-vocal-physical-cues [2] https://news.yahoo.com/high-schooler-calls-ai-regulations-232607486.html [3] https://www.cnbc.com/2023/11/02/tesla-boss-elon-musk-says-ai-will-create-situation-where-no-job-is-needed.html [4] https://www.npr.org/2023/11/04/1210649637/artificial-intelligence-is-being-used-to-id-goose-faces submitted by /u/Excellent-Target-847 [link] [comments]
    I made a series of sci-fi AI adventure games that are backed by GPT and DALL-E, what do you think?
    submitted by /u/cryptoz [link] [comments]
    The Malignant King is No More
    Produced using the new version of Gen-2 RunwayML. submitted by /u/Exitium_Maximus [link] [comments]
    Any AI apps that automatically convert language into English from whatever the screens displaying?
    submitted by /u/aesthetion [link] [comments]
  • Open

    [D] Data Cleaning vs Feature Engineering - where to draw the line? Ex:
    Nitpickers, please sharpen your pencils. I want to hear from you! Data Cleaning vs Feature Engineering - where do you draw the line? ex: Definitely Data Cleaning: Filling missing values ex: Definitely Feature Engineering: Creating 1 synthetic feature from 3 existing columns ex: ?? Maybe feature engineering: Applying StandardScaler() to normalize data (mean 0, standard deviation 1) before any training occurs submitted by /u/CuriousFemalle [link] [comments]  ( 9 min )
    [D] AI Master in Europe
    Hi everyone, a bit of context about me. I'm in my last year of Computer Science in Italy, actually I'm one month into an internship which will last about 2 more months and hopefully next week I'll have my last exam. I've started looking around to see what to do next and I think I'll continue the studies with an Artificial Intelligence master's degree I still don't know where but the only thing I know is that I want to move abroad. I would like to ask you which would be the best options within Europe, I know some in Germany and Netherlands but I would love to hear your opinions about it. Thanks to everyone who will take some of their time to reply! submitted by /u/saasyp [link] [comments]  ( 9 min )
    ELI5: what is analog deep learning? [D]
    I just read about the push to develop analog computer chips, eg https://news.mit.edu/2022/analog-deep-learning-ai-computing-0728 But how is analog hardware different from digital, and why specifically would it be better for neural networks? Is it better at matrix multiplication and if so how? submitted by /u/quantumofgalaxy [link] [comments]  ( 9 min )
    [R] Diffusion might be a better way to model randomness in PPLs than Markov chain Monte Carlo or VI
    Probabilistic programming languages (PPLs) like Stan simplify modeling uncertainty but inference is still slow and inaccurate. Markov chain Monte Carlo is precise but sometimes slow. Variational inference is faster but has other drawbacks. Is diffusion a better way to model probability? A new technique called diffusion model variational inference (DMVI) uses diffusion models to approximate the probability distributions for faster, more accurate automated inference. (BTW: This is part of a trend I've noticed lately, where researchers are increasingly applying diffusion to diverse problems like mapping heat flow to robot obstacle avoidance and anomaly detection.) DMVI sets up the guess distribution using a diffusion model run in reverse. It introduces a new way to calculate the marginal likelihood for better data fitting. It also adjusts parameters for an even better fit. Early tests show DMVI makes inferences generally more accurately than other PPL methods, with similar compute costs and limited tuning needed. TLDR: Framing inference as a diffusion problem can potentially overcome limitations of current methods. DMVI might become a core part of the PPL toolkit. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] - Tall and narrow database
    I am currently training NN on a tall and narrow database (30 variables, about 1M observations, it is sports data). It trains reasonably fast, only a couple of minutes. However, I am going to be getting a lot more data soon. I want to understand if graphics cards will speed up all types of machine learning, or will they work better for different types of data. Are there any good articles explaining. Thank you submitted by /u/ajplant [link] [comments]  ( 9 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 9 min )
    [D] From RNNs to GPT4 - 10 years of NLP research explained in 50 concepts
    In this video from my YT channel, I explain 50 concepts that cover the basics of NLP like Tokenization and Word Embeddings, to seminal work like RNNs, Seq2Seq, Attention, to innovative Transformer models like BERT, GPT, XL-Net, and InstructGPT. I present the challenges we have faced in previous designs, and what the current architectures do to improve it, and upcoming challenges with Hallucination and Alignment. Sharing a link here for those interested. submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [D]The Name of the Game: Abba’s Björn on AI and Music Rights
    submitted by /u/DutchTechJunkie [link] [comments]  ( 9 min )
    [N] Computer Vision News of November 2023 with BEST OF ICCV
    Dear all, Here is Computer Vision News of November 2023. Read 64 pages with Best of ICCV and an inspiring interview with Yann LeCun. Online version (recommended) PDF version Free subscription on page 64. Enjoy! https://preview.redd.it/y7mnfljv0iyb1.jpg?width=400&format=pjpg&auto=webp&s=fee061d3dddf4837094acf849ad425d1afe2ddcd submitted by /u/Gletta [link] [comments]  ( 9 min )
    [N] Everything you should now about Grok 4/7
    submitted by /u/SDMegaFan [link] [comments]  ( 9 min )
    [D] The bar for technical novelty at ICLR and simultaneous submissions
    I just came across two papers submitted to ICLR this year by the same group: https://openreview.net/pdf?id=IP28nY6TJQ https://openreview.net/pdf?id=lJYAkDVnRU Although the two papers are in different domains, their proposed methods are almost identical if you swap out the encoders (see Figure 1 in both papers). idk, these papers technically don't break any ICLR rules (that I know of), but they seem to violate the spirit of the conference. What do you all think? The chance of the second paper being accepted at a conference would be much lower if the first paper is published first, and vice-versa. So, it seems like the authors are taking advantage of the fact that they came up with both variants of the method at the same time and are kind of "double-dipping." ​ submitted by /u/CrypticDNS [link] [comments]  ( 9 min )
    [D] Cross validation and Training Auc
    I am using gradient boosted tree for a classification problem with 2 percent event rate data. The top auc on cross validation Is 0.78 but when I am using the top hyperparameters from the croosvalidator and Training a separate model on the same data I am getting an AUC of 0.5. I am not getting why this is happening. submitted by /u/Snoo71755 [link] [comments]  ( 9 min )
    [Project][P] Creating Twin Delayed Deep Deterministic Policy Gradient (TD3) with TensorFlow JS
    Hi everyone (wasn't sure if this counts as a beginner project or not? Seams pretty advanced to me). I've been using ChatGPT(3.5) to help me convert Python code using TD3 into JavaScript with TensorFlow JS. This is for the community and not for personal gain. I've yet to find an example of this, hence the conversion. My goal is to make a basic blueprint for the community to use on TensorFlow JS projects. When complete, the agent will be displayed on an HTML5 canvas walking toward a civilian for good reward, while avoiding a zombie (negative penalty). Eventually I will develop a separate, far more advanced simulation based on this. The bad news: ChatGPT is shakey when it comes to complex conversions, and my Python/Tensorflow knowledge base isn't perfect. At the moment the agent isn't learning yet, but it's running without errors. I expect the code has mistakes I don't even know about yet. The good news: I have made a lot of progress and have a GitHub repository set up for the community to learn from and use the project: https://github.com/CloudZero2049/TD3-TensorFlowJS I would love for anyone who knows the intricacies of TD3 (DDPG is a close relative), and TensorFlow JS to help me get this blueprint project setup for everyone =) The README on GitHub has more info and resources. submitted by /u/CloudZero2049 [link] [comments]  ( 9 min )
    [D] Machine Learning in production
    Hello everyone, I'm a full stack developer, I studied as a robotics engineer and completed different courses and certifications on machine learning. I would like to know what are the main technologies to know to work in machine learning in production? What courses gives a real skills and value to work in the industry? How machine learning models are deployed and exposed on servers? submitted by /u/AcquaFisc [link] [comments]  ( 9 min )
  • Open

    Transfer RL
    What is the most famous benchmark example of transfer reinforcement learning? Does it work in continuous actions? submitted by /u/MomoSolar [link] [comments]
    RL for solving a scheduling problem
    Does anyone have an example, where an RL agent is used to solve a scheduling problem? This does not have to be the case where RL provides an improvement over traditional methods used in scheduling. I would just like to have a look at an example, theory and code implementation. Thanks! submitted by /u/MomoSolar [link] [comments]
    What algorithm should I use in this situation.
    Hi guys i'm new to reinforcement learning so I need help with this situation: i want to make a portfolio optimisation agent that will take the state: historical data perform an action: output a box of percentages, each percentage corresponds to the amount of capital I will allocate for a certain stock. please let me know if you need more information. submitted by /u/AymanElmar [link] [comments]
    RL applications and basic assumptions, RL & data science, did I miss something basically?
    With the success of AlphaGo and GPT, Reinforcement Learning (RL) becomes increasingly important to bring AI to practice. More and more publications just apply RL for the sake of applying RL, sometimes we miss the basic theoretical assumptions in the problem models, i.e., Markov property -> Markov decision process (MDP). https://preview.redd.it/n5zefdbmljyb1.png?width=1200&format=png&auto=webp&s=43a42e22667c6327ec6511fa7aaae984094c8099 As all RLer know the problems that can be solved by RL is obeying a basic assumption that our problem can be represented as an MDP. Considering a simple question, in an electronic business scenario, assuming that we want to make a dynamic pricing or other sales promotion action according to the website click volume, does it satisfy the MDP requirements or j…
    MC Methods
    Can MC methods be used for policy improvement, or are they just used for policy evaluation? submitted by /u/MomoSolar [link] [comments]
    Simple tutorial on Extreme Q-Learning
    Any simple tutorial on Extreme Q-learning - theory and implementation? submitted by /u/MomoSolar [link] [comments]
    DQN vs Deep Sarsa
    Why is DQN so famous, while deep sarsa isn’t? Is it because Deep Sarsa is on-policy? If that is the case, I do not get it. The action a is sampled in both cases using epsilon-greedy. It’s just that a’ for DQN is the greedy, while that for Sarsa is epsilon-greedy. But how does that make a difference? submitted by /u/MomoSolar [link] [comments]
    Difference between DDPG and Policy Gradient
    I still cannot distinguish between regular policy gradient and DDPG, although the latter is supposed to be an extension of DQNs to the continuous action domains? submitted by /u/MomoSolar [link] [comments]
    Code for a paper
    Is there a code available for the paper “Risk-Aware Transfer in Reinforcement Learning using Successor Features” published in NeurIPS 2021 by Gimelfarb et Al.? submitted by /u/MomoSolar [link] [comments]
    Solving an optimization problem using RL
    I know that there are much better methods to do it, but can RL solve an optimization problem (linear, convex non-linear, non-convex)? If yes, is there a good link for an implementation / code? submitted by /u/MomoSolar [link] [comments]
    Stochasticity in the Cart Pole example
    In the famous Cart Pole example in OpenAI gym, from where does the stochasticity come from? submitted by /u/MomoSolar [link] [comments]
    Making Twin Delayed Deep Deterministic Policy Gradient (TD3) with TensorFlow JS
    Hi everyone. I've been using ChatGPT(3.5) to help me convert Python code using TD3 into JavaScript with TensorFlow JS. This is for the community and not for personal gain. My goal is to make a basic blueprint for the community to use on TensorFlow JS projects. When complete, the agent will be displayed on an HTML5 canvas walking toward a civilian for good reward, while avoiding a zombie (negative penalty). The bad news: I'm not a professional of Python or Tensorflow JS, and ChatGPT is shakey when it comes to complex tasks. At the moment the agent isn't learning yet, but it's running without errors. I expect the code has mistakes I don't even know about yet. The good news: I have made a lot of progress and have a GitHub repository set up for the community to learn from and use the project: https://github.com/CloudZero2049/TD3-TensorFlowJS I would love for anyone who knows the intricacies of TD3 (DDPG is a close relative), and TensorFlow JS to help me get this blueprint project setup for everyone =) The README on GitHub has more info and resources. submitted by /u/CloudZero2049 [link] [comments]
    Why can't I import DQNAgent?
    ```python from tensorflow.python.keras.models import Sequentialfrom from tensorflow.python.keras.layers import Dense, Flattenfrom from tensorflow.python.keras.optimizer_v1 import Adam from rl.agents import DQNAgent from rl.policy import BoltzmannQPolicy from rl.memory import SequentialMemory print("hello world") ``` This is my whole code and why can't I import `DQNAgent`? I am new to RL area. Everything is working well without this line : "from rl.agents.dqn import DQNAgent" Error is like this : ``` Traceback (most recent call last): File "/Users/isaac/temp.py", line 5, in from rl.agents.dqn import DQNAgent File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/agents/__init__.py", line 1, in from .dqn import DQNAgent, NAFAgent, ContinuousDQNAgent File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/agents/dqn.py", line 7, in from rl.core import Agent File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/core.py", line 7, in from rl.callbacks import ( File "/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/rl/callbacks.py", line 8, in from tensorflow.keras import __version__ as KERAS_VERSION ImportError: cannot import name '__version__' from 'tensorflow.keras' (/private/var/folders/q9/mtrgmhn96yq900lqp_sn9vgr0000gn/T/rlTest.py17471611196844209727/venv/lib/python3.11/site-packages/keras/api/_v2/keras/__init__.py) ``` submitted by /u/Subject-Ad-9345 [link] [comments]
    MyoArm
    I'm trying to create a MuJoCo/Open AI task for the new MyoSuite arm with 27 DoF https://github.com/MyoHub/myo_sim/tree/main/arm. Any ideas on some resources that I can use? ​ submitted by /u/Terrible_Sleep_3484 [link] [comments]
  • Open

    KL divergence from normal to normal
    The previous post looked at the best approximation to a normal density by normal density with a different mean. Dan Piponi suggested in the comments that it would be good to look at the Kullback-Leibler (KL) divergence. The previous post looked at the difference from between two densities from an analytic perspective, solving the problem […] KL divergence from normal to normal first appeared on John D. Cook.  ( 5 min )
  • Open

    Is Reinforcement Learning really used in industry? If so, is it comparable to other forms of NN?
    I'm thinking of specializing in RL while doing a PhD in environmental engineering (more specifically agriculture). The research, together with my interests, led me naturally to RL as a tool to solve problems and achieve interesting research results. But then I started wondering whether it's "worth it" to specialize in this since i intend to work in industry, rather than academia. Hence my question: Is RL really used in some applications in industry? Which ones? If it is, is it at least used comparably as much as supervised or unsupervised learning? Really I'm looking to understand as much as possible how is RL used in industry so whatever you can answer about that would be much appreciated. Thanks! ​ submitted by /u/vniversvs_ [link] [comments]

  • Open

    [P] Open Sourcing Llmtuner - An Experimental Framework for Finetuning Large Models Like Whisper and Llama with scikit-learn-inspired interface
    Hi Folks, Happy to share an open source side project I've been working on - LLmtuner. It's a framework for finetuning large models like Whisper, Llama, Llama-2, etc with best practices like LoRA, QLoRA, through a sleek, scikit-learn-inspired interface. As someone who works with Large Models a lot, I found myself writing a lot of boilerplate code every time I wanted to finetune a model. Llmtuner aims to simplify the finetuning process down to just 2-3 lines to get training started, similar to scikit-learn. Sample usecase Supported Models 🚀 Features: 🧙‍♀️ Finetune state-of-the-art LLMs like Whisper, Llama with minimal code 🔨 Built-in utilities for techniques like LoRA and QLoRA ✌ Launch webapp demos for your finetuned models with one click 💥 Fast inference without separate code 🌐 Easy model sharing and deployment coming soon This is still experimental code I've been using for personal projects. I thought others might find it useful too so decided to open-source it. Github : https://github.com/promptslab/LLMtuner For quick demo : Colab Contributions and feedback are very welcome! I hope it will be helpful in your research & projects. Have a good weekend, Thanks :) submitted by /u/Traditional-Poet2746 [link] [comments]  ( 9 min )
    [N] Tropical Probabilistic AI School (Tropical ProbAI), Rio de Janeiro — Jan 29-Feb 2, 2024
    Tropical Probabilistic AI School — Jan 29-Feb 2, 2024 You are invited to apply for the 1st Tropical Probabilistic AI School (TropAI), held on January 29-Feb 2, 2024, in Rio de Janeiro, Brazil. APPLY NOW — The application deadline is November 23, AoE (Anywhere on Earth). The ProbAI is here to provide an inclusive educational environment that serves state-of-the-art machine learning and artificial intelligence expertise with a probabilistic twist. Whether you are a Ph.D. student, advanced MSc or BSc student, experienced researcher, engineer, or hobbyist, you're welcome to join our community of learners. With a carefully designed curriculum and a seamless blend of theory and hands-on sessions, our expert team of invited lecturers will guide you through five intensive days, each dedicat…  ( 10 min )
    [D] Can someone please help me find this research paper? I can't find it anywhere
    submitted by /u/sad_and_stupid [link] [comments]  ( 8 min )
    [D] Backpropagation for gradient computation
    Hi all, I'm studying Deep learning from Santanu Pattanayak's "Pro Deep Learning with TensorFlow 2.0: A Mathematical Approach to Advanced Artificial Intelligence in Python". At the section where Is explained backpropagation I can't get why partial derivatives have these result: can you help me please ? I post expressions for each layer of and XOR function and thus derivatives for each layer. In the 1st image there's the structore of the XOR itself. In particular I don't understand why dz3/di3 has this result (near black arrow). I read in the previous step z3 expressed as a function of z3 (typing error?) Thanks to all! submitted by /u/ArlingtonBeech343 [link] [comments]  ( 9 min )
    [P] Query standardization for semantic search
    Hi all, I'm new to LLM-based app development and I need some help improving the performance of the semantic search part of my application. My system is pretty simple. User submits a query, the query is ran against a knowledge bank using semantic search. I'm using Azure Cognitive Search for the semantic search piece, so I think I'm good there. What I need help in is standardizing the query. More specifically: Templetizing: I was thinking of using NER to replace places, companies and acronyms with placeholders in the query, but I doubt something open-source would work out of the box would work. Thoughts or suggestions on this? Query breakdown: oftentimes, the query submitted by the user is a collection of multiple questions. I need to find a way to break them down individual queries. I feel like this must be a common problem with LLMs, is these are a tried-and-true solution that you could point me to? Thanks in advance submitted by /u/Different-Student859 [link] [comments]  ( 9 min )
    [P] Hello, I need some recommendations for libraries/modules I can use to help me with my project!
    What Python or JavaScript module (or combination of modules) can I use to analyze images for the following purpose? The image provided must: - Contain a face. - Emotion has to be neutral. - Face has to be up-straight and not tilted. - Face has to be fully visible, no hair, no hats no nothing. - The image has to be clear and not blurry. - Not too zoomed in, Not too zoomed out. - No filters. - White background. I already know of the DeepFace repository, but I'm still collecting information before I can touch any of the code. Any help is greatly appreciated. submitted by /u/goatonamissionn [link] [comments]  ( 9 min )
    [R] A Survey on Large Language Model based Autonomous Agents
    Paper: https://arxiv.org/abs/2308.11432 GitHub: https://github.com/Paitesanshi/LLM-Agent-Survey Abstract: Autonomous agents have long been a prominent research focus in both academic and industry communities. Previous research in this field often focuses on training agents with limited knowledge within isolated environments, which diverges significantly from human learning processes, and thus makes the agents hard to achieve human-like decisions. Recently, through the acquisition of vast amounts of web knowledge, large language models (LLMs) have demonstrated remarkable potential in achieving human-level intelligence. This has sparked an upsurge in studies investigating LLM-based autonomous agents. In this paper, we present a comprehensive survey of these studies, delivering a systematic review of the field of LLM-based autonomous agents from a holistic perspective. More specifically, we first discuss the construction of LLM-based autonomous agents, for which we propose a unified framework that encompasses a majority of the previous work. Then, we present a comprehensive overview of the diverse applications of LLM-based autonomous agents in the fields of social science, natural science, and engineering. Finally, we delve into the evaluation strategies commonly used for LLM-based autonomous agents. Based on the previous studies, we also present several challenges and future directions in this field. To keep track of this field and continuously update our survey, we maintain a repository of relevant references at this https URL. ​ https://preview.redd.it/wq5wic33bdyb1.png?width=2464&format=png&auto=webp&s=56d061d2c0cfdc1aff9783ed4e3bae664c43c4b2 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Surveying and breaking down the recent history of Multimodal AI Models
    submitted by /u/AvvYaa [link] [comments]  ( 8 min )
    [D] Tensorflow Recommendation Model System Design and Use in Production
    I am a senior CS student and I am building my capstone. We are a team of two and are required to build and implement an AI model in our project. The model we are building will make recommendations, it uses an NN based on the Neural Collaborative Filtering paper. We are using Next.js for our frontend and some of the backend using the BFF pattern. We need a separate backend for one of the features that uses websockets. At first, we wanted to build the separate backend with FastAPI because our model would be built with Tensorflow and Python, but then someone suggested using Express.js or some other JS backend and exporting the model that was made in Python and using Tensorflow.js to include it in the app. I am concerned about the way we might continuously improve our recommendation model. Using Python and exporting it to Tensorflow.js every time does not seem like a good solution. Even if at this level and for a project this small we don't have to worry about it, professors might ask us about our plan for the future and we need to have an answer ready. My other concern is that Tensorflow.js would affect the performance of the model. How do big companies that use recommendation systems solve this? I know Netflix uses the microservices architecture which allows them to use different languages across services. What about the Modular Monolithic architecture, would the Netflix approach still work? submitted by /u/iTsObserv [link] [comments]  ( 9 min )
    [D]iffusion + CLIP Chunks to Generate Image with Region Control.
    I am trying to reconstruct an image from chunks of CLIP embeddings. My current workflow would be as follows: 1.) Chunk image into regions and generate CLIP embeddings for each region 2.) Modify CLIP embeddings with some structual control 3.) Re-generate the image from CLIP embeddings. (Use masked inpainting to generate each sub-region of the image at each timestep and combine regions together before the next pass). Motivation: Generate CLIP-like embeddings from a GPT model and use this model to modify specific parts of an image with text instructions. Does this approach seem sensible? Would I be able to do this with a pretrained diffusion model with decent results, or would an approach like concatenating the CLIP embeddings and passing them together with position embeddings work better? TLDR: Given CLIP embeddings from image chunks, what is the best way to reconstruct the image? submitted by /u/codys12 [link] [comments]  ( 9 min )
    TensorFlow vs PyTorch vs JAX: GitHub star counts seem surprisingly linear. What do you think the future holds for these frameworks? [D]
    submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [R] Anomaly Detection in Multivariate Time Series with Diffusion Models
    Identifying anomalies in time-series data enables the detection of major events such as medical conditions, financial fraud, network intrusions, or equipment failures. Recent improvements in autoencoders, GANs, and transformers have demonstrated promising results in identifying time series anomalies (I recently covered a paper about inverting transformers for time-series data here). However, consistently recognizing anomalies across diverse datasets remains challenging due to the complexity of modeling temporal dependencies. A new paper investigates a novel application of diffusion models, which have shown exceptional performance in image and audio generation tasks. The key hypothesis is that diffusion processes may smooth out normal patterns while amplifying irregularities in anomalies, improving detectability. At a high level: The authors propose training diffusion models on time series data corrupted with Gaussian noise. At test time, noise is added to inputs which the model must denoise. The difference between the original and denoised sequences produces an anomaly score. Experiments were conducted on synthetic and real-world multivariate time series datasets containing various anomaly types. The paper uses evaluation metrics like F1K-AUC and ROCK-AUC rather than just F1 scores. Results demonstrate that a diffusion autoencoder model combining diffusion with autoencoders performs strongly on anomaly detection across the studied datasets. TLDR: Diffusion processes are great at smoothing out normal patterns while amplifying anomalies, which this paper finds makes them useful for anomaly detection. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Pytorch model visualizer that makes pretty diagrams
    Hey everyone! I am looking for a python library or external tool that makes aesthetically pleasing diagrams of any Pytorch network. I know about libraries like torchviz and similar, but their visualizations don't fit the style I want. I am looking for something with a style similar to this: https://preview.redd.it/pi0ylecgkcyb1.png?width=1920&format=png&auto=webp&s=a2c89ddb0631fd8caadf326a16c6939fac0cacce Thanks in advance 🙂 submitted by /u/Skirlaxx [link] [comments]  ( 9 min )
    [R] Highlights for every NeurIPS 2023 paper
    Here is the list of all NeurIPS 2023 (Neural Information Processing Systems) papers and a short highlight for each of them. Among all ~3,500 papers, authors of around 1,000 papers also made their code or data available. The 'related code' link under paper title will take you directly to the code base. https://www.paperdigest.org/2023/10/nips-2023-highlights/ In addition, here is the link of "search within NeurIPS 2023" that can be used to find papers within NeurIPS-2023 related to a specific topic, e.g. "diffusion model": https://www.paperdigest.org/search/?topic=nips&year=2023&q=diffusion_model NeurIPS 2023 will take place at New Orleans on Dec 10, 2023. submitted by /u/biandangou [link] [comments]  ( 9 min )
    [D] [P] Advice needed: local vs cloud based processing and software requirements - computer vision model for weed identification in biodiversity planting
    Hi all, newbie here hoping for advice from the community. Context: We run a charity focussed on biodiversity planting. Due to high costs, we’re developing a computer vision model for weed identification and targeting. This can increase environmental benefits by reducing chemical use Our technical team is looking to ramp up training of the computer vision model to differentiate between weeds and native plants. A mechanical or laser based weed removal mechanism will then target the weed. Problem: we are weighing up the benefits of cloud-based image-processing vs local-processing. Understand cloud-based solutions like AWS EC2 P3 enable training of more complex models. But as we are resource constrained, we are considering a more powerful machine to enable local processing and reduce long-term variable costs, which I understand are significant. Questions: Feasibility: Is it even plausible that we could train a model locally with computer vision for weed identification? Some research suggests that this would be difficult if not impossible due to GPU and data requirements. If it is plausible, what specs would we need for a machine? A donor can buy us an Apple new MacBook M3 Max with 16 core CPU, 40 core GPU, 16 core neural engine, 126GB memory machine. Understand NVIDIA’s chips have superior ML performance and cost, but we are locked into Apple for various reasons (image based work etc). If pursuing local training, we guess 2-4TB may be better; if cloud-based, 1TB might suffice. At our early stage of planning, wondering what experts here think about our plan? Should we pursue cloud-based training regardless? Please excuse the newbie questions - we’re a charity and learning fast! Any help or advice welcome. submitted by /u/Complete-Baby8711 [link] [comments]  ( 9 min )
    [P] Research Papers (October 2023)
    submitted by /u/seraschka [link] [comments]  ( 9 min )
    [P]Fast and Portable Llama2 Inference on the Heterogeneous Edge
    submitted by /u/smileymileycoin [link] [comments]  ( 9 min )
    [D]: Sampling with vs without replacement
    In what context is sampling with replacement superior to sampling without replacement when training a machine learning algorithm, and vice versa? Opinions diverge, and I never encountered a good rule of thumb here. submitted by /u/Blutorangensaft [link] [comments]  ( 9 min )
    [D] How to build this pipeline
    Hey everyone! Right now I'm working on a project to finetune Speech to Text models with real-world data. Basically, users record audio on the frontend and it gets sent to the backend. The backend stores these audio recordings as audio files on a cloud bucket. The second step is transcribing these audio recordings. This is done by our contractors who use a UI we built that retrieves untranscribed audio recordings from the bucket, allows the contractor to listen to the audio and write a transcript and then submits the transcript to a backend which stores the transcript along with the id of the audio recordings in a SQL DB. ​ Now we want to train/finetune ASR models (mostly whisper) on these labeled audio recordings. My question is, how do I design and implement a data pipeline that gets the data from the cloud bucket and the sql db (which has the transcript) and aggregates it and makes it ready for the asr training/finetuning. I have heard of Apache Airflow for building data pipelines but I've never used it. Will this be the right tool for the job? Can you please provide details/best practices/tool recommendation on how to build such a pipeline? What I'm thinking about is using a tool to create parquet files that have two main columns: audio (floating point array of audio data) and transcript (text column that has the transcription for audio) and some other columns for metadata ​ Note: We're using a small cloud provider that is not AWS, Azure or GCP so please recommend open source tools submitted by /u/Amgadoz [link] [comments]  ( 9 min )
    [D] Xgboost with lime/sharp
    I built an xgboost model using Python that uses 10 features to make its predictions. Some of those features categorical and are encoded using one hot encoding. The model preforms well but I’m having trouble exploring feature weighting for a specific row instance. After encoding I go from 10 features to 5000+ as expected. The problem is I can’t figure out a way to show the feature value influences on a single rows instance for only the 10 features. Is this possible or do I need to use another form of encoding? submitted by /u/DataDojoDude [link] [comments]  ( 9 min )
    [D]How to fine tune LLMs using deepspeed without OOM issues
    I've been trying to fine tune the llama 2 13b model (not quantized) on AWS g5.12x instance which has 4*24gb A10GPUs, and 192gb ram. I'm also using PEFT lora for fine tuning. I've been trying to fine-tune it with hugging face trainer along with deepspeed stage 3 because it could offload the parameters into the cpu, but I run into out of memory errors irrespective of the batch size or my sequence length. In the deepspeed configuration file I have given the offload optimizer and offload param to cpu as well. Any ideas on where I could be going wrong? Or is the model just too big for my machine even with deepspeed? submitted by /u/IXMachina [link] [comments]  ( 9 min )
  • Open

    Normal approximation to normal
    In my previous post on approximating a logistic distribution with a normal distribution I accidentally said something about approximating a normal with a normal. Obviously the best approximation to a probability distribution is itself. As Norbert Wiener said “The best material model of a cat is another, or preferably the same, cat.” But this made […] Normal approximation to normal first appeared on John D. Cook.  ( 5 min )
    Logistic / Normal approximation
    In a recent post I pointed out that a soliton, a solution to the KdV equation, looks a lot like a normal density for fixed x. As someone pointed out in the comments, one way to look at this is that the soliton is exactly proportional to the density of a logistic distribution, and it’s […] Logistic / Normal approximation first appeared on John D. Cook.  ( 6 min )
  • Open

    China's AI Analog Chip Claimed to Be 3000X Faster Than Nvidia's A100 GPU
    China's ACCEL chip, developed by Tsinghua University, is claimed to be 3000 times faster than Nvidia's A100 GPU and has 4000 million times higher energy efficiency. The chip leverages photonic and analog computing in a specialized architecture, delivering over 3000 times the performance of the Nvidia A100 at an energy consumption that's four million times lower. ACCEL can perform 4.6 trillion operations per second in vision tasks, which is a significant improvement compared to Nvidia's A100. The chip has shown high accuracy levels in various computer vision applications, including Fashion-MNIST, 3-class ImageNet classification, and time-lapse video recognition tasks. ACCEL operates through diffractive optical analog computing (OAC) and electronic analog computing (EAC), with 99% of its operation implemented within the optical system. The photonic, optical system of ACCEL reduces energy requirements and waste heat, resulting in higher energy efficiency compared to digital systems like Nvidia's GPU. The chip's low computing latency and high throughput make it suitable for real-time applications. ACCEL is considered an analog rendition of an Application-Specific Integrated Circuit (ASIC) design, with the electronic analog computing (EAC) unit reconfiguring analog pathways to accelerate specific tasks. The development of ACCEL represents a significant achievement in computing architecture for the AI era, with potential practical applications in various fields. Source : https://www.tomshardware.com/tech-industry/semiconductors/chinas-accel-analog-chip-promises-to-outpace-industry-best-in-ai-acceleration-for-vision-tasks submitted by /u/NuseAI [link] [comments]
    Elon Musk is getting ready to launch his first AI model to premium X users. 'Grok' will be 'based' and 'loves sarcasm,' Musk said.
    submitted by /u/thisisinsider [link] [comments]
    Firms like Meta and A16z admit having to pay billions for training data would ruin their generative-AI plans as they fight new copyright rules
    submitted by /u/geekteam6 [link] [comments]
    What's Under the Hood of Adobe Firefly?
    Hey everyone, I've been exploring the capabilities of Adobe Firefly, and I'm quite intrigued by its functionalities. I understand it's Adobe's AI framework designed for creative tasks, but I'm curious about the specific technologies and models it employs. Does anyone have insights into the type of models or algorithms that power Adobe Firefly? Is it using something similar to GANs, CNNs, or perhaps a different kind of neural network architecture? Also, how does it compare to other image-based AI models like DALL-E or CLIP in terms of its image processing and generation capabilities? Would love to dive deeper into the technical details if anyone's got the scoop! submitted by /u/cheapnessltd [link] [comments]
    Agi will be the end of humans
    Ai is the end for humanity. We’ll probably evolve into some cyborg hybrid integrated with computer chips or our bodies will be preserved like in the matrix while the ai cyborgs harvest our consciousness to exist. Sure right now companies can cut costs but eventually there’s no need for most companies out there if no one works. Money will become worthless too since you don’t need it to survive and will probably hunt or farm for food. The more I think about it, it seems like the human species has some suicidal death wish programmed into our brains or, more likely, we are competing with another intelligence that’s using us as means to an end of its evolution. submitted by /u/YSLFAHLIFE [link] [comments]
    "Understanding the Potential Dangers of AI Humanoids: Insights from Elon Musk"
    submitted by /u/Fit-Code-5141 [link] [comments]
    How To Outsource AI Content Creation 3x Cheaper With Freelancers
    hello readers Not so long ago I finished writing my article about How To Outsource AI Content Creation 3x Cheaper With Freelancers. I was wondering what real fans and admirers of AI topics think about it, I really want you to read my article and give some fair feedback about it. submitted by /u/PerceptionPlayful469 [link] [comments]
  • Open

    Guide for MARLLib
    The documentation for MARLlib is pretty lacklustre, does anyone know of any tutorial on how to make custom environments work with it, and also example code for handling training loops, etc? I'm mostly having issues with understanding what the ''make()'' function does from marllib. submitted by /u/EquivalentCurious745 [link] [comments]
    Custom Boid Flocking environment (Open AI Gym)
    Background: Boids info are given here: https://en.wikipedia.org/wiki/Boids. I was able to successfully implement Reynold's model flocking (Results below). My open ai gym implementation doesn't work though. Objective: Build an RL Custom Open AI Gym Boid flocking environment, trained on Stable Baselines3 PPO algorithm. Error: Error What I have tried: Initializations and NaN value debugging. Honestly, have no idea what to do. I am an amateur with like 2 months of experience in Open AI gym, please be gentle. ​ Results(Reynold's Model): Reynold's model flocking with 20 agents -RL code is named as Env.py and Error as Error.txt. -Flocking using reynold's model is called Agent.py, it works perfectly ​ ​ https://preview.redd.it/wlsasy6r5dyb1.png?width=1569&format=png&auto=webp&s=59661ae34273b459b2cf193775c1861851f6747e ​ Link to files and error: https://drive.google.com/drive/folders/1RhsVen6CQNh0b1PWqT7FbTggYKDKEqsF?usp=sharing submitted by /u/Sadboi1010 [link] [comments]
    A beginner friendly introduction to Deep RL discussing four of the greatest seminal works
    Hey people, sharing a video from my ML YT channel where I discuss what RL is all about and discuss four great papers from 2010s that personally got me into the subject during my grad study days… its kinda beginner friendly in tone, but more appropriately it highlights the strengths and challenges, and the key algorithmic ideas in the field! submitted by /u/AvvYaa [link] [comments]
    Visual observations for centralised critic in MADDPG
    Hello everyone, I'm working currently on a project that involves a multi-agent systems problem that I intend to solve using MARL. One common good practice when doing MARL, is to use a centralized critic for all agents, this critic takes as input the observations of all agents. In the case where the agents rely on RGB images, there are many ways to feed the observations to the centralized critic. My question is what is the best way to do so ( Should i just concatentate the images across the channel, Generate features of each image and concatenate after ...etc ) ?. Thank you in advance. ​ submitted by /u/Many_Reception_4921 [link] [comments]
    RL courses online
    What courses are really useful to understand the theory of RL? Stanford? Berkeley? submitted by /u/MomoSolar [link] [comments]
    Q-LP
    Are there any sources for the Linear Programming solution to the Q-function? submitted by /u/MomoSolar [link] [comments]
    RL books - references
    Are there references that explain algorithms in modern RL like TRPO, PPO, A3C, in a clear way, and possibly implements them? submitted by /u/MomoSolar [link] [comments]
    DDPG tutorial
    Is there a site that explains DDPG theory and has an implementation for it in a very clear way? submitted by /u/MomoSolar [link] [comments]
    Policy Gradient for Continous actions
    What if my action space is continuous, but is within a certain (interval) range? submitted by /u/MomoSolar [link] [comments]
    Q-learning for demand response
    In the paper (in the first comment), optimal electricity prices are determined via Q-learning. The MDP includes energy demand as states and electricity prices as actions. The reward is a weighted sum of service provider profit and customer satisfaction. In this case, state t=2 need not be dependent on state t=1 and action t = 1. I do not understand how Q(st, at) can then represent the discounted sum of expected rewards, especially that st+1 may not follow from an action taken at st. Is the modelling of the MDP valid? https://pdf.sciencedirectassets.com/271429/1-s2.0-S0306261918X00099/1-s2.0-S0306261918304112/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELr%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIGsBT5yR8u2kFHHVNsJMX4FAkc%2FB%2BuT0Elulb6gCmnntAiEAi2OBcPqvWhhGAQvYRKCCkc6dBRB4…
    AI MarketPlace to buy and sell ML models
    Hi, Im working on creating an AI marketplace where developers can upload models and startups, and enterprises can deploy and run them in the cloud at scale. Any feedback would be greatly appreciated! We are currently onboarding developers and waitlisting buyers. Here is our interest form: https://forms.gle/X4Wy7NyMcWULddEBA submitted by /u/Dismal-Call2668 [link] [comments]
  • Open

    Your Data-to-Value Journey Starts with AI and Data Literacy
    79.8% of organizations cite cultural barriers to data adoption, yet AI and data literacy rank at only 1.6% in the CDO’s list of priorities. I find the research from New Vantage Partners, headed by industry legends Tom Davenport and Randy Bean, incredibly valuable.  Their annual “Data and Analytics Leadership Annual Executive Survey” series delivers invaluable… Read More »Your Data-to-Value Journey Starts with AI and Data Literacy The post Your Data-to-Value Journey Starts with AI and Data Literacy appeared first on Data Science Central.  ( 23 min )
  • Open

    How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. (arXiv:2305.00586v5 [cs.CL] UPDATED)
    Pre-trained language models can be surprisingly adept at tasks they were not explicitly trained on, but how they implement these capabilities is poorly understood. In this paper, we investigate the basic mathematical abilities often acquired by pre-trained language models. Concretely, we use mechanistic interpretability techniques to explain the (limited) mathematical abilities of GPT-2 small. As a case study, we examine its ability to take in sentences such as "The war lasted from the year 1732 to the year 17", and predict valid two-digit end years (years > 32). We first identify a circuit, a small subset of GPT-2 small's computational graph that computes this task's output. Then, we explain the role of each circuit component, showing that GPT-2 small's final multi-layer perceptrons boost the probability of end years greater than the start year. Finally, we find related tasks that activate our circuit. Our results suggest that GPT-2 small computes greater-than using a complex but general mechanism that activates across diverse contexts.
    Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation. (arXiv:2305.15208v2 [stat.ML] UPDATED)
    Simulation-based inference (SBI) enables amortized Bayesian inference for simulators with implicit likelihoods. But when we are primarily interested in the quality of predictive simulations, or when the model cannot exactly reproduce the observed data (i.e., is misspecified), targeting the Bayesian posterior may be overly restrictive. Generalized Bayesian Inference (GBI) aims to robustify inference for (misspecified) simulator models, replacing the likelihood-function with a cost function that evaluates the goodness of parameters relative to data. However, GBI methods generally require running multiple simulations to estimate the cost function at each parameter value during inference, making the approach computationally infeasible for even moderately complex simulators. Here, we propose amortized cost estimation (ACE) for GBI to address this challenge: We train a neural network to approximate the cost function, which we define as the expected distance between simulations produced by a parameter and observed data. The trained network can then be used with MCMC to infer GBI posteriors for any observation without running additional simulations. We show that, on several benchmark tasks, ACE accurately predicts cost and provides predictive simulations that are closer to synthetic observations than other SBI methods, especially for misspecified simulators. Finally, we apply ACE to infer parameters of the Hodgkin-Huxley model given real intracellular recordings from the Allen Cell Types Database. ACE identifies better data-matching parameters while being an order of magnitude more simulation-efficient than a standard SBI method. In summary, ACE combines the strengths of SBI methods and GBI to perform robust and simulation-amortized inference for scientific simulators.
    Bayesian Design Principles for Frequentist Sequential Learning. (arXiv:2310.00806v2 [cs.LG] UPDATED)
    We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate "algorithmic beliefs" at each round, and use Bayesian posteriors to make decisions. The optimization objective to create "algorithmic beliefs," which we term "Algorithmic Information Ratio," represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the "best-of-all-worlds" empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.
    Exclusive Group Lasso for Structured Variable Selection. (arXiv:2108.10284v2 [cs.LG] UPDATED)
    A structured variable selection problem is considered in which the covariates, divided into predefined groups, activate according to sparse patterns with few nonzero entries per group. Capitalizing on the concept of atomic norm, a composite norm can be properly designed to promote such exclusive group sparsity patterns. The resulting norm lends itself to efficient and flexible regularized optimization algorithms for support recovery, like the proximal algorithm. Moreover, an active set algorithm is proposed that builds the solution by successively including structure atoms into the estimated support. It is also shown that such an algorithm can be tailored to match more rigid structures than plain exclusive group sparsity. Asymptotic consistency analysis (with both the number of parameters as well as the number of groups growing with the observation size) establishes the effectiveness of the proposed solution in terms of signed support recovery under conventional assumptions. Finally, a set of numerical simulations further corroborates the results.
    Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features. (arXiv:2311.00489v2 [cs.SD] UPDATED)
    While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
    Extremal Domain Translation with Neural Optimal Transport. (arXiv:2301.12874v3 [cs.LG] UPDATED)
    In many unpaired image domain translation problems, e.g., style transfer or super-resolution, it is important to keep the translated image similar to its respective input image. We propose the extremal transport (ET) which is a mathematical formalization of the theoretically best possible unpaired translation between a pair of domains w.r.t. the given similarity function. Inspired by the recent advances in neural optimal transport (OT), we propose a scalable algorithm to approximate ET maps as a limit of partial OT maps. We test our algorithm on toy examples and on the unpaired image-to-image translation task. The code is publicly available at https://github.com/milenagazdieva/ExtremalNeuralOptimalTransport
    Private Graph Extraction via Feature Explanations. (arXiv:2206.14724v2 [cs.LG] UPDATED)
    Privacy and interpretability are two important ingredients for achieving trustworthy machine learning. We study the interplay of these two aspects in graph machine learning through graph reconstruction attacks. The goal of the adversary here is to reconstruct the graph structure of the training data given access to model explanations. Based on the different kinds of auxiliary information available to the adversary, we propose several graph reconstruction attacks. We show that additional knowledge of post-hoc feature explanations substantially increases the success rate of these attacks. Further, we investigate in detail the differences between attack performance with respect to three different classes of explanation methods for graph neural networks: gradient-based, perturbation-based, and surrogate model-based methods. While gradient-based explanations reveal the most in terms of the graph structure, we find that these explanations do not always score high in utility. For the other two classes of explanations, privacy leakage increases with an increase in explanation utility. Finally, we propose a defense based on a randomized response mechanism for releasing the explanations, which substantially reduces the attack success rate. Our code is available at https://github.com/iyempissy/graph-stealing-attacks-with-explanation
    Fedstellar: A Platform for Decentralized Federated Learning. (arXiv:2306.09750v2 [cs.LG] UPDATED)
    In 2016, Google proposed Federated Learning (FL) as a novel paradigm to train Machine Learning (ML) models across the participants of a federation while preserving data privacy. Since its birth, Centralized FL (CFL) has been the most used approach, where a central entity aggregates participants' models to create a global one. However, CFL presents limitations such as communication bottlenecks, single point of failure, and reliance on a central server. Decentralized Federated Learning (DFL) addresses these issues by enabling decentralized model aggregation and minimizing dependency on a central entity. Despite these advances, current platforms training DFL models struggle with key issues such as managing heterogeneous federation network topologies. To overcome these challenges, this paper presents Fedstellar, a novel platform designed to train FL models in a decentralized, semi-decentralized, and centralized fashion across diverse federations of physical or virtualized devices. The Fedstellar implementation encompasses a web application with an interactive graphical interface, a controller for deploying federations of nodes using physical or virtual devices, and a core deployed on each device which provides the logic needed to train, aggregate, and communicate in the network. The effectiveness of the platform has been demonstrated in two scenarios: a physical deployment involving single-board devices such as Raspberry Pis for detecting cyberattacks, and a virtualized deployment comparing various FL approaches in a controlled environment using MNIST and CIFAR-10 datasets. In both scenarios, Fedstellar demonstrated consistent performance and adaptability, achieving F1 scores of 91%, 98%, and 91.2% using DFL for detecting cyberattacks and classifying MNIST and CIFAR-10, respectively, reducing training time by 32% compared to centralized approaches.
    Castor: Causal Temporal Regime Structure Learning. (arXiv:2311.01412v1 [cs.LG])
    The task of uncovering causal relationships among multivariate time series data stands as an essential and challenging objective that cuts across a broad array of disciplines ranging from climate science to healthcare. Such data entails linear or non-linear relationships, and usually follow multiple a priori unknown regimes. Existing causal discovery methods can infer summary causal graphs from heterogeneous data with known regimes, but they fall short in comprehensively learning both regimes and the corresponding causal graph. In this paper, we introduce CASTOR, a novel framework designed to learn causal relationships in heterogeneous time series data composed of various regimes, each governed by a distinct causal graph. Through the maximization of a score function via the EM algorithm, CASTOR infers the number of regimes and learns linear or non-linear causal relationships in each regime. We demonstrate the robust convergence properties of CASTOR, specifically highlighting its proficiency in accurately identifying unique regimes. Empirical evidence, garnered from exhaustive synthetic experiments and two real-world benchmarks, confirm CASTOR's superior performance in causal discovery compared to baseline methods. By learning a full temporal causal graph for each regime, CASTOR establishes itself as a distinctly interpretable method for causal discovery in heterogeneous time series.
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v6 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional "content" latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining "texture" variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with X-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality.
    SRN-SZ: Deep Leaning-Based Scientific Error-bounded Lossy Compression with Super-resolution Neural Networks. (arXiv:2309.04037v2 [cs.LG] UPDATED)
    The fast growth of computational power and scales of modern super-computing systems have raised great challenges for the management of exascale scientific data. To maintain the usability of scientific data, error-bound lossy compression is proposed and developed as an essential technique for the size reduction of scientific data with constrained data distortion. Among the diverse datasets generated by various scientific simulations, certain datasets cannot be effectively compressed by existing error-bounded lossy compressors with traditional techniques. The recent success of Artificial Intelligence has inspired several researchers to integrate neural networks into error-bounded lossy compressors. However, those works still suffer from limited compression ratios and/or extremely low efficiencies. To address those issues and improve the compression on the hard-to-compress datasets, in this paper, we propose SRN-SZ, which is a deep learning-based scientific error-bounded lossy compressor leveraging the hierarchical data grid expansion paradigm implemented by super-resolution neural networks. SRN-SZ applies the most advanced super-resolution network HAT for its compression, which is free of time-costing per-data training. In experiments compared with various state-of-the-art compressors, SRN-SZ achieves up to 75% compression ratio improvements under the same error bound and up to 80% compression ratio improvements under the same PSNR than the second-best compressor.
    TRIALSCOPE A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models. (arXiv:2311.01301v1 [cs.LG])
    The rapid digitization of real-world data offers an unprecedented opportunity for optimizing healthcare delivery and accelerating biomedical discovery. In practice, however, such data is most abundantly available in unstructured forms, such as clinical notes in electronic medical records (EMRs), and it is generally plagued by confounders. In this paper, we present TRIALSCOPE, a unifying framework for distilling real-world evidence from population-level observational data. TRIALSCOPE leverages biomedical language models to structure clinical text at scale, employs advanced probabilistic modeling for denoising and imputation, and incorporates state-of-the-art causal inference techniques to combat common confounders. Using clinical trial specification as generic representation, TRIALSCOPE provides a turn-key solution to generate and reason with clinical hypotheses using observational data. In extensive experiments and analyses on a large-scale real-world dataset with over one million cancer patients from a large US healthcare network, we show that TRIALSCOPE can produce high-quality structuring of real-world data and generates comparable results to marquee cancer trials. In addition to facilitating in-silicon clinical trial design and optimization, TRIALSCOPE may be used to empower synthetic controls, pragmatic trials, post-market surveillance, as well as support fine-grained patient-like-me reasoning in precision diagnosis and treatment.
    The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing. (arXiv:2311.01410v1 [cs.CV])
    We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
    QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models. (arXiv:2310.09259v2 [cs.LG] UPDATED)
    Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios, such as batched inference or prompt processing. In this paper, we address the general quantization problem, where both weights and activations should be quantized. We show, for the first time, that the majority of inference computations for large generative models such as LLaMA, OPT, and Falcon can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups, while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. The key feature of our scheme is that it is designed with computational efficiency in mind: we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families, as well as a first instance of accurate inference using quantization plus 2:4 sparsity. Code is available at: https://github.com/IST-DASLab/QUIK.
    Description and Discussion on DCASE 2023 Challenge Task 2: First-Shot Unsupervised Anomalous Sound Detection for Machine Condition Monitoring. (arXiv:2305.07828v2 [cs.SD] UPDATED)
    We present the task description of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023 Challenge Task 2: ``First-shot unsupervised anomalous sound detection (ASD) for machine condition monitoring''. The main goal is to enable rapid deployment of ASD systems for new kinds of machines without the need for hyperparameter tuning. In the past ASD tasks, developed methods tuned hyperparameters for each machine type, as the development and evaluation datasets had the same machine types. However, collecting normal and anomalous data as the development dataset can be infeasible in practice. In 2023 Task 2, we focus on solving the first-shot problem, which is the challenge of training a model on a completely novel machine type. Specifically, (i) each machine type has only one section (a subset of machine type) and (ii) machine types in the development and evaluation datasets are completely different. Analysis of 86 submissions from 23 teams revealed that the keys to outperform baselines were: 1) sampling techniques for dealing with class imbalances across different domains and attributes, 2) generation of synthetic samples for robust detection, and 3) use of multiple large pre-trained models to extract meaningful embeddings for the anomaly detector.
    MiliPoint: A Point Cloud Dataset for mmWave Radar. (arXiv:2309.13425v2 [cs.LG] UPDATED)
    Millimetre-wave (mmWave) radar has emerged as an attractive and cost-effective alternative for human activity sensing compared to traditional camera-based systems. mmWave radars are also non-intrusive, providing better protection for user privacy. However, as a Radio Frequency (RF) based technology, mmWave radars rely on capturing reflected signals from objects, making them more prone to noise compared to cameras. This raises an intriguing question for the deep learning community: Can we develop more effective point set-based deep learning methods for such attractive sensors? To answer this question, our work, termed MiliPoint, delves into this idea by providing a large-scale, open dataset for the community to explore how mmWave radars can be utilised for human activity recognition. Moreover, MiliPoint stands out as it is larger in size than existing datasets, has more diverse human actions represented, and encompasses all three key tasks in human activity recognition. We have also established a range of point-based deep neural networks such as DGCNN, PointNet++ and PointTransformer, on MiliPoint, which can serve to set the ground baseline for further development.
    Bridging Machine Learning and Sciences: Opportunities and Challenges. (arXiv:2210.13441v2 [stat.ML] UPDATED)
    The application of machine learning in sciences has seen exciting advances in recent years. As a widely applicable technique, anomaly detection has been long studied in the machine learning community. Especially, deep neural nets-based out-of-distribution detection has made great progress for high-dimensional data. Recently, these techniques have been showing their potential in scientific disciplines. We take a critical look at their applicative prospects including data universality, experimental protocols, model robustness, etc. We discuss examples that display transferable practices and domain-specific challenges simultaneously, providing a starting point for establishing a novel interdisciplinary research paradigm in the near future.
    Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization. (arXiv:2310.04015v3 [cs.LG] UPDATED)
    While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.
    Ranking with Popularity Bias: User Welfare under Self-Amplification Dynamics. (arXiv:2305.18333v2 [cs.IR] UPDATED)
    While popularity bias is recognized to play a crucial role in recommmender (and other ranking-based) systems, detailed analysis of its impact on collective user welfare has largely been lacking. We propose and theoretically analyze a general mechanism, rooted in many of the models proposed in the literature, by which item popularity, item quality, and position bias jointly impact user choice. We focus on a standard setting in which user utility is largely driven by item quality, and a recommender attempts to estimate it given user behavior. Formulating the problem as a non-stationary contextual bandit, we study the ability of a recommender policy to maximize user welfare under this model. We highlight the importance of exploration, not to eliminate popularity bias, but to mitigate its negative impact on welfare. We first show that naive popularity-biased recommenders induce linear regret by conflating item quality and popularity. More generally, we show that, even in linear settings, identifiability of item quality may not be possible due to the confounding effects of popularity bias. However, under sufficient variability assumptions, we develop an efficient optimistic algorithm and prove efficient regret guarantees w.r.t. user welfare. We complement our analysis with several simulation studies, which demonstrate the negative impact of popularity bias on the performance of several natural recommender policies.
    Combining Optimal Path Search With Task-Dependent Learning in a Neural Network. (arXiv:2201.11104v6 [cs.LG] UPDATED)
    Finding optimal paths in connected graphs requires determining the smallest total cost for traveling along the graph's edges. This problem can be solved by several classical algorithms where, usually, costs are predefined for all edges. Conventional planning methods can, thus, normally not be used when wanting to change costs in an adaptive way following the requirements of some task. Here we show that one can define a neural network representation of path finding problems by transforming cost values into synaptic weights, which allows for online weight adaptation using network learning mechanisms. When starting with an initial activity value of one, activity propagation in this network will lead to solutions, which are identical to those found by the Bellman-Ford algorithm. The neural network has the same algorithmic complexity as Bellman-Ford and, in addition, we can show that network learning mechanisms (such as Hebbian learning) can adapt the weights in the network augmenting the resulting paths according to some task at hand. We demonstrate this by learning to navigate in an environment with obstacles as well as by learning to follow certain sequences of path nodes. Hence, the here-presented novel algorithm may open up a different regime of applications where path-augmentation (by learning) is directly coupled with path finding in a natural way.
    Fine-grained Expressivity of Graph Neural Networks. (arXiv:2306.03698v2 [cs.LG] UPDATED)
    Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the $1$-dimensional Weisfeiler-Leman test ($1$-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both $1$-WL and MPNNs to graphons. Concretely, we show that the continuous variant of $1$-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the $1$-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the $1$-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous $1$-WL test in understanding MPNNs' expressivity.
    A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories. (arXiv:2311.01329v1 [cs.LG])
    Offline imitation from observations aims to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available. Offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. The state-of-the-art "DIstribution Correction Estimation" (DICE) methods minimize divergence of state occupancy between expert and learner policies and retrieve a policy with weighted behavior cloning; however, their results are unstable when learning from incomplete trajectories, due to a non-robust optimization in the dual domain. To address the issue, in this paper, we propose Trajectory-Aware Imitation Learning from Observations (TAILO). TAILO uses a discounted sum along the future trajectory as the weight for weighted behavior cloning. The terms for the sum are scaled by the output of a discriminator, which aims to identify expert states. Despite simplicity, TAILO works well if there exist trajectories or segments of expert behavior in the task-agnostic data, a common assumption in prior work. In experiments across multiple testbeds, we find TAILO to be more robust and effective, particularly with incomplete trajectories.
    Model-free Policy Learning with Reward Gradients. (arXiv:2103.05147v4 [cs.LG] UPDATED)
    Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit{Reward Policy Gradient} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.
    On Learning Gaussian Multi-index Models with Gradient Flow. (arXiv:2310.19793v2 [stat.ML] UPDATED)
    We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated `saddle-to-saddle' dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related \emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.
    Adv3D: Generating Safety-Critical 3D Objects through Closed-Loop Simulation. (arXiv:2311.01446v1 [cs.RO])
    Self-driving vehicles (SDVs) must be rigorously tested on a wide range of scenarios to ensure safe deployment. The industry typically relies on closed-loop simulation to evaluate how the SDV interacts on a corpus of synthetic and real scenarios and verify it performs properly. However, they primarily only test the system's motion planning module, and only consider behavior variations. It is key to evaluate the full autonomy system in closed-loop, and to understand how variations in sensor data based on scene appearance, such as the shape of actors, affect system performance. In this paper, we propose a framework, Adv3D, that takes real world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance, and finds vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning). Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.
    Computable Phenotypes of Patient Acuity in the Intensive Care Unit. (arXiv:2005.05163v2 [q-bio.QM] UPDATED)
    Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records and (2) describe transitions between acuity states that illustrate the clinical trajectories of ICU patients. We gathered two single-center, longitudinal electronic health record datasets for 51,372 adult ICU patients admitted to the University of Florida Health (UFH) Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acuity status at four-hour intervals for each ICU admission and identify acuity phenotypes using continuous acuity status and k-means clustering approach. 51,073 admissions for 38,749 patients in the UFH GNV dataset and 22,219 admissions for 12,623 patients in the UFH JAX dataset had at least one ICU stay lasting more than four hours. There were three phenotypes: persistently stable, persistently unstable, and transitioning from unstable to stable. For stable patients, approximately 0.7%-1.7% would transition to unstable, 0.02%-0.1% would expire, 1.2%-3.4% would be discharged, and the remaining 96%-97% would remain stable in the ICU every four hours. For unstable patients, approximately 6%-10% would transition to stable, 0.4%-0.5% would expire, and the remaining 89%-93% would remain unstable in the ICU in the next four hours. We developed phenotyping algorithms for patient acuity status every four hours while admitted to the ICU. This approach may be useful in developing prognostic and clinical decision-support tools to aid patients, caregivers, and providers in shared decision-making processes regarding escalation of care and patient values.
    EVBattery: A Large-Scale Electric Vehicle Dataset for Battery Health and Capacity Estimation. (arXiv:2201.12358v3 [cs.LG] UPDATED)
    Electric vehicles (EVs) play an important role in reducing carbon emissions. As EV adoption accelerates, safety issues caused by EV batteries have become an important research topic. In order to benchmark and develop data-driven methods for this task, we introduce a large and comprehensive dataset of EV batteries. Our dataset includes charging records collected from hundreds of EVs from three manufacturers over several years. Our dataset is the first large-scale public dataset on real-world battery data, as existing data either include only several vehicles or is collected in the lab environment. Meanwhile, our dataset features two types of labels, corresponding to two key tasks - battery health estimation and battery capacity estimation. In addition to demonstrating how existing deep learning algorithms can be applied to this task, we further develop an algorithm that exploits the data structure of battery systems. Our algorithm achieves better results and shows that a customized method can improve model performances. We hope that this public dataset provides valuable resources for researchers, policymakers, and industry professionals to better understand the dynamics of EV battery aging and support the transition toward a sustainable transportation system.
    High-dimensional Linear Bandits with Knapsacks. (arXiv:2311.01327v1 [cs.LG])
    We study the contextual bandits with knapsack (CBwK) problem under the high-dimensional setting where the dimension of the feature is large. The reward of pulling each arm equals the multiplication of a sparse high-dimensional weight vector and the feature of the current arrival, with additional random noise. In this paper, we investigate how to exploit this sparsity structure to achieve improved regret for the CBwK problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime. We finally conduct numerical experiments to show the efficient empirical performance of our algorithms under the high dimensional setting.
    Collaborative Learning via Prediction Consensus. (arXiv:2305.18497v2 [cs.LG] UPDATED)
    We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme that serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate empirically that our collaboration scheme is able to significantly boost individual models' performance in the target domain from which the auxiliary data is sampled. At the same time, it can provably mitigate the negative impact of bad models on the collective. By design, our method adeptly accommodates heterogeneity in model architectures and substantially reduces communication overhead compared to typical collaborative learning methods.
    Causal Falsification of Digital Twins. (arXiv:2301.07210v4 [stat.ME] UPDATED)
    Digital twins are virtual systems designed to predict how a real-world process will evolve in response to interventions. This modelling paradigm holds substantial promise in many applications, but rigorous procedures for assessing their accuracy are essential for safety-critical settings. We consider how to assess the accuracy of a digital twin using real-world data. We formulate this as causal inference problem, which leads to a precise definition of what it means for a twin to be "correct" appropriate for many applications. Unfortunately, fundamental results from causal inference mean observational data cannot be used to certify that a twin is correct in this sense unless potentially tenuous assumptions are made, such as that the data are unconfounded. To avoid these assumptions, we propose instead to find situations in which the twin is not correct, and present a general-purpose statistical procedure for doing so. Our approach yields reliable and actionable information about the twin under only the assumption of an i.i.d. dataset of observational trajectories, and remains sound even if the data are confounded. We apply our methodology to a large-scale, real-world case study involving sepsis modelling within the Pulse Physiology Engine, which we assess using the MIMIC-III dataset of ICU patients.
    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. (arXiv:2307.05973v2 [cs.RO] UPDATED)
    Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. Videos and code at https://voxposer.github.io
    Targeted Separation and Convergence with Kernel Discrepancies. (arXiv:2209.12835v2 [stat.ML] UPDATED)
    Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.
    (Amplified) Banded Matrix Factorization: A unified approach to private training. (arXiv:2306.08153v2 [cs.LG] UPDATED)
    Matrix factorization (MF) mechanisms for differential privacy (DP) have substantially improved the state-of-the-art in privacy-utility-computation tradeoffs for ML applications in a variety of scenarios, but in both the centralized and federated settings there remain instances where either MF cannot be easily applied, or other algorithms provide better tradeoffs (typically, as $\epsilon$ becomes small). In this work, we show how MF can subsume prior state-of-the-art algorithms in both federated and centralized training settings, across all privacy budgets. The key technique throughout is the construction of MF mechanisms with banded matrices (lower-triangular matrices with at most $\hat{b}$ nonzero bands including the main diagonal). For cross-device federated learning (FL), this enables multiple-participations with a relaxed device participation schema compatible with practical FL infrastructure (as demonstrated by a production deployment). In the centralized setting, we prove that banded matrices enjoy the same privacy amplification results as the ubiquitous DP-SGD algorithm, but can provide strictly better performance in most scenarios -- this lets us always at least match DP-SGD, and often outperform it.
    1D-CapsNet-LSTM: A Deep Learning-Based Model for Multi-Step Stock Index Forecasting. (arXiv:2310.02090v2 [cs.LG] UPDATED)
    Multi-step stock index forecasting is vital in finance for informed decision-making. Current forecasting methods on this task frequently produce unsatisfactory results due to the inherent data randomness and instability, thereby underscoring the demand for advanced forecasting models. Given the superiority of capsule network (CapsNet) over CNN in various forecasting and classification tasks, this study investigates the potential of integrating a 1D CapsNet with an LSTM network for multi-step stock index forecasting. To this end, a hybrid 1D-CapsNet-LSTM model is introduced, which utilizes a 1D CapsNet to generate high-level capsules from sequential data and a LSTM network to capture temporal dependencies. To maintain stochastic dependencies over different forecasting horizons, a multi-input multi-output (MIMO) strategy is employed. The model's performance is evaluated on real-world stock market indices, including S&P 500, DJIA, IXIC, and NYSE, and compared to baseline models, including LSTM, RNN, and CNN-LSTM, using metrics such as RMSE, MAE, MAPE, and TIC. The proposed 1D-CapsNet-LSTM model consistently outperforms baseline models in two key aspects. It exhibits significant reductions in forecasting errors compared to baseline models. Furthermore, it displays a slower rate of error increase with lengthening forecast horizons, indicating increased robustness for multi-step forecasting tasks.
    CADSim: Robust and Scalable in-the-wild 3D Reconstruction for Controllable Sensor Simulation. (arXiv:2311.01447v1 [cs.CV])
    Realistic simulation is key to enabling safe and scalable development of % self-driving vehicles. A core component is simulating the sensors so that the entire autonomy system can be tested in simulation. Sensor simulation involves modeling traffic participants, such as vehicles, with high quality appearance and articulated geometry, and rendering them in real time. The self-driving industry has typically employed artists to build these assets. However, this is expensive, slow, and may not reflect reality. Instead, reconstructing assets automatically from sensor data collected in the wild would provide a better path to generating a diverse and large set with good real-world coverage. Nevertheless, current reconstruction approaches struggle on in-the-wild sensor data, due to its sparsity and noise. To tackle these issues, we present CADSim, which combines part-aware object-class priors via a small set of CAD models with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance. Our experiments show our method recovers more accurate shapes from sparse data compared to existing approaches. Importantly, it also trains and renders efficiently. We demonstrate our reconstructed vehicles in several applications, including accurate testing of autonomy perception systems.
    Data Summarization beyond Monotonicity: Non-monotone Two-Stage Submodular Maximization. (arXiv:2309.05183v2 [cs.DS] UPDATED)
    The objective of a two-stage submodular maximization problem is to reduce the ground set using provided training functions that are submodular, with the aim of ensuring that optimizing new objective functions over the reduced ground set yields results comparable to those obtained over the original ground set. This problem has applications in various domains including data summarization. Existing studies often assume the monotonicity of the objective function, whereas our work pioneers the extension of this research to accommodate non-monotone submodular functions. We have introduced the first constant-factor approximation algorithms for this more general case.
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v4 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.
    ILCAS: Imitation Learning-Based Configuration-Adaptive Streaming for Live Video Analytics with Cross-Camera Collaboration. (arXiv:2308.10068v2 [cs.NI] UPDATED)
    The high-accuracy and resource-intensive deep neural networks (DNNs) have been widely adopted by live video analytics (VA), where camera videos are streamed over the network to resource-rich edge/cloud servers for DNN inference. Common video encoding configurations (e.g., resolution and frame rate) have been identified with significant impacts on striking the balance between bandwidth consumption and inference accuracy and therefore their adaption scheme has been a focus of optimization. However, previous profiling-based solutions suffer from high profiling cost, while existing deep reinforcement learning (DRL) based solutions may achieve poor performance due to the usage of fixed reward function for training the agent, which fails to craft the application goals in various scenarios. In this paper, we propose ILCAS, the first imitation learning (IL) based configuration-adaptive VA streaming system. Unlike DRL-based solutions, ILCAS trains the agent with demonstrations collected from the expert which is designed as an offline optimal policy that solves the configuration adaption problem through dynamic programming. To tackle the challenge of video content dynamics, ILCAS derives motion feature maps based on motion vectors which allow ILCAS to visually ``perceive'' video content changes. Moreover, ILCAS incorporates a cross-camera collaboration scheme to exploit the spatio-temporal correlations of cameras for more proper configuration selection. Extensive experiments confirm the superiority of ILCAS compared with state-of-the-art solutions, with 2-20.9% improvement of mean accuracy and 19.9-85.3% reduction of chunk upload lag.
    Like an Open Book? Read Neural Network Architecture with Simple Power Analysis on 32-bit Microcontrollers. (arXiv:2311.01344v1 [cs.CR])
    Model extraction is a growing concern for the security of AI systems. For deep neural network models, the architecture is the most important information an adversary aims to recover. Being a sequence of repeated computation blocks, neural network models deployed on edge-devices will generate distinctive side-channel leakages. The latter can be exploited to extract critical information when targeted platforms are physically accessible. By combining theoretical knowledge about deep learning practices and analysis of a widespread implementation library (ARM CMSIS-NN), our purpose is to answer this critical question: how far can we extract architecture information by simply examining an EM side-channel trace? For the first time, we propose an extraction methodology for traditional MLP and CNN models running on a high-end 32-bit microcontroller (Cortex-M7) that relies only on simple pattern recognition analysis. Despite few challenging cases, we claim that, contrary to parameters extraction, the complexity of the attack is relatively low and we highlight the urgent need for practicable protections that could fit the strong memory and latency requirements of such platforms.
    Add and Thin: Diffusion for Temporal Point Processes. (arXiv:2311.01139v1 [cs.LG])
    Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.
    Online Continual Learning Without the Storage Constraint. (arXiv:2305.09253v2 [cs.CV] UPDATED)
    Traditional online continual learning (OCL) research has primarily focused on mitigating catastrophic forgetting with fixed and limited storage allocation throughout an agent's lifetime. However, a broad range of real-world applications are primarily constrained by computational costs rather than storage limitations. In this paper, we target such applications, investigating the online continual learning problem under relaxed storage constraints and limited computational budgets. We contribute a simple algorithm, which updates a kNN classifier continually along with a fixed, pretrained feature extractor. We selected this algorithm due to its exceptional suitability for online continual learning. It can adapt to rapidly changing streams, has zero stability gap, operates within tiny computational budgets, has low storage requirements by only storing features, and has a consistency property: It never forgets previously seen data. These attributes yield significant improvements, allowing our proposed algorithm to outperform existing methods by over 20% in accuracy on two large-scale OCL datasets: Continual LOCalization (CLOC) with 39M images and 712 classes and Continual Google Landmarks V2 (CGLM) with 580K images and 10,788 classes, even when existing methods retain all previously seen images. Furthermore, we achieve this superior performance with considerably reduced computational and storage expenses. We provide code to reproduce our results at github.com/drimpossible/ACM.
    Combating Bilateral Edge Noise for Robust Link Prediction. (arXiv:2311.01196v1 [cs.LG])
    Although link prediction on graphs has achieved great success with the development of graph neural networks (GNNs), the potential robustness under the edge noise is still less investigated. To close this gap, we first conduct an empirical study to disclose that the edge noise bilaterally perturbs both input topology and target label, yielding severe performance degradation and representation collapse. To address this dilemma, we propose an information-theory-guided principle, Robust Graph Information Bottleneck (RGIB), to extract reliable supervision signals and avoid representation collapse. Different from the basic information bottleneck, RGIB further decouples and balances the mutual dependence among graph topology, target labels, and representation, building new learning objectives for robust representation against the bilateral noise. Two instantiations, RGIB-SSL and RGIB-REP, are explored to leverage the merits of different methodologies, i.e., self-supervised learning and data reparameterization, for implicit and explicit data denoising, respectively. Extensive experiments on six datasets and three GNNs with diverse noisy scenarios verify the effectiveness of our RGIB instantiations. The code is publicly available at: https://github.com/tmlr-group/RGIB.
    The Behavior and Convergence of Local Bayesian Optimization. (arXiv:2305.15572v2 [cs.LG] UPDATED)
    A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v2 [cs.LG] UPDATED)
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as a black-box optimization problem over the discrete chemical space. In practice, multiple conflicting objectives and costly evaluations (e.g., wet-lab experiments) make the diversity of candidates paramount. Computational methods have achieved initial success but still struggle with considering diversity in both objective and search space. To fill this gap, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. We further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. We empirically illustrate that HN-GFN has adequate capacity to generalize over preferences. Moreover, experiments in various real-world MOBO settings demonstrate that our framework predominantly outperforms existing methods in terms of candidate quality and sample efficiency. The code is available at https://github.com/violet-sto/HN-GFN.
    EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models. (arXiv:2307.02028v2 [cs.LG] UPDATED)
    While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from the Stanford AIMI Center. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark
    A deep learning experiment for semantic segmentation of overlapping characters in palimpsests. (arXiv:2311.01130v1 [cs.CV])
    Palimpsests refer to historical manuscripts where erased writings have been partially covered by the superimposition of a second writing. By employing imaging techniques, e.g., multispectral imaging, it becomes possible to identify features that are imperceptible to the naked eye, including faded and erased inks. When dealing with overlapping inks, Artificial Intelligence techniques can be utilized to disentangle complex nodes of overlapping letters. In this work, we propose deep learning-based semantic segmentation as a method for identifying and segmenting individual letters in overlapping characters. The experiment was conceived as a proof of concept, focusing on the palimpsests of the Ars Grammatica by Prisciano as a case study. Furthermore, caveats and prospects of our approach combined with multispectral imaging are also discussed.
    AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models. (arXiv:2311.01305v1 [cs.LG])
    Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come with significant computational and storage costs. Quantizing these models is an effective way to alleviate this issue. However, existing methods struggle to strike a balance between model accuracy and hardware efficiency. This is where we introduce AWEQ, a post-training method that requires no additional training overhead. AWEQ excels in both ultra-low-bit quantization and 8-bit weight and activation (W8A8) quantization. There is an observation that weight quantization is less challenging than activation quantization. AWEQ transfers the difficulty of activation quantization to weights using channel equalization, achieving a balance between the quantization difficulties of both, and thereby maximizing performance. We have further refined the equalization method to mitigate quantization bias error, ensuring the robustness of the model. Extensive experiments on popular models such as LLaMA and OPT demonstrate that AWEQ outperforms all existing post-training quantization methods for large models.
    Role of Structural and Conformational Diversity for Machine Learning Potentials. (arXiv:2311.00862v1 [physics.chem-ph])
    In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts.
    Improving Robustness via Tilted Exponential Layer: A Communication-Theoretic Perspective. (arXiv:2311.01047v1 [cs.LG])
    State-of-the-art techniques for enhancing robustness of deep networks mostly rely on empirical risk minimization with suitable data augmentation. In this paper, we propose a complementary approach motivated by communication theory, aimed at enhancing the signal-to-noise ratio at the output of a neural network layer via neural competition during learning and inference. In addition to minimization of a standard end-to-end cost, neurons compete to sparsely represent layer inputs by maximization of a tilted exponential (TEXP) objective function for the layer. TEXP learning can be interpreted as maximum likelihood estimation of matched filters under a Gaussian model for data noise. Inference in a TEXP layer is accomplished by replacing batch norm by a tilted softmax, which can be interpreted as computation of posterior probabilities for the competing signaling hypotheses represented by each neuron. After providing insights via simplified models, we show, by experimentation on standard image datasets, that TEXP learning and inference enhances robustness against noise and other common corruptions, without requiring data augmentation. Further cumulative gains in robustness against this array of distortions can be obtained by appropriately combining TEXP with data augmentation techniques.
    Learning to Design and Use Tools for Robotic Manipulation. (arXiv:2311.00754v1 [cs.RO])
    When limited by their own morphologies, humans and some species of animals have the remarkable ability to use objects from the environment toward accomplishing otherwise impossible tasks. Robots might similarly unlock a range of additional capabilities through tool use. Recent techniques for jointly optimizing morphology and control via deep learning are effective at designing locomotion agents. But while outputting a single morphology makes sense for locomotion, manipulation involves a variety of strategies depending on the task goals at hand. A manipulation agent must be capable of rapidly prototyping specialized tools for different goals. Therefore, we propose learning a designer policy, rather than a single design. A designer policy is conditioned on task information and outputs a tool design that helps solve the task. A design-conditioned controller policy can then perform manipulation using these tools. In this work, we take a step towards this goal by introducing a reinforcement learning framework for jointly learning these policies. Through simulated manipulation tasks, we show that this framework is more sample efficient than prior methods in multi-goal or multi-variant settings, can perform zero-shot interpolation or fine-tuning to tackle previously unseen goals, and allows tradeoffs between the complexity of design and control policies under practical constraints. Finally, we deploy our learned policies onto a real robot. Please see our supplementary video and website at https://robotic-tool-design.github.io/ for visualizations.
    Exploring Deep Learning Techniques for Glaucoma Detection: A Comprehensive Review. (arXiv:2311.01425v1 [eess.IV])
    Glaucoma is one of the primary causes of vision loss around the world, necessitating accurate and efficient detection methods. Traditional manual detection approaches have limitations in terms of cost, time, and subjectivity. Recent developments in deep learning approaches demonstrate potential in automating glaucoma detection by detecting relevant features from retinal fundus images. This article provides a comprehensive overview of cutting-edge deep learning methods used for the segmentation, classification, and detection of glaucoma. By analyzing recent studies, the effectiveness and limitations of these techniques are evaluated, key findings are highlighted, and potential areas for further research are identified. The use of deep learning algorithms may significantly improve the efficacy, usefulness, and accuracy of glaucoma detection. The findings from this research contribute to the ongoing advancements in automated glaucoma detection and have implications for improving patient outcomes and reducing the global burden of glaucoma.
    Deep Transformed Gaussian Processes. (arXiv:2310.18230v2 [cs.LG] UPDATED)
    Transformed Gaussian Processes (TGPs) are stochastic processes specified by transforming samples from the joint distribution from a prior process (typically a GP) using an invertible transformation; increasing the flexibility of the base process. Furthermore, they achieve competitive results compared with Deep Gaussian Processes (DGPs), which are another generalization constructed by a hierarchical concatenation of GPs. In this work, we propose a generalization of TGPs named Deep Transformed Gaussian Processes (DTGPs), which follows the trend of concatenating layers of stochastic processes. More precisely, we obtain a multi-layer model in which each layer is a TGP. This generalization implies an increment of flexibility with respect to both TGPs and DGPs. Exact inference in such a model is intractable. However, we show that one can use variational inference to approximate the required computations yielding a straightforward extension of the popular DSVI inference algorithm Salimbeni et al (2017). The experiments conducted evaluate the proposed novel DTGPs in multiple regression datasets, achieving good scalability and performance.
    AutoMixer for Improved Multivariate Time-Series Forecasting on Business and IT Observability Data. (arXiv:2310.20280v2 [cs.LG] UPDATED)
    The efficiency of business processes relies on business key performance indicators (Biz-KPIs), that can be negatively impacted by IT failures. Business and IT Observability (BizITObs) data fuses both Biz-KPIs and IT event channels together as multivariate time series data. Forecasting Biz-KPIs in advance can enhance efficiency and revenue through proactive corrective measures. However, BizITObs data generally exhibit both useful and noisy inter-channel interactions between Biz-KPIs and IT events that need to be effectively decoupled. This leads to suboptimal forecasting performance when existing multivariate forecasting models are employed. To address this, we introduce AutoMixer, a time-series Foundation Model (FM) approach, grounded on the novel technique of channel-compressed pretrain and finetune workflows. AutoMixer leverages an AutoEncoder for channel-compressed pretraining and integrates it with the advanced TSMixer model for multivariate time series forecasting. This fusion greatly enhances the potency of TSMixer for accurate forecasts and also generalizes well across several downstream tasks. Through detailed experiments and dashboard analytics, we show AutoMixer's capability to consistently improve the Biz-KPI's forecasting accuracy (by 11-15\%) which directly translates to actionable business insights.
    Generalizing Nonlinear ICA Beyond Structural Sparsity. (arXiv:2311.00866v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to uncover the true latent sources from their observable nonlinear mixtures. Despite its significance, the identifiability of nonlinear ICA is known to be impossible without additional assumptions. Recent advances have proposed conditions on the connective structure from sources to observed variables, known as Structural Sparsity, to achieve identifiability in an unsupervised manner. However, the sparsity constraint may not hold universally for all sources in practice. Furthermore, the assumptions of bijectivity of the mixing process and independence among all sources, which arise from the setting of ICA, may also be violated in many real-world scenarios. To address these limitations and generalize nonlinear ICA, we propose a set of new identifiability results in the general settings of undercompleteness, partial sparsity and source dependence, and flexible grouping structures. Specifically, we prove identifiability when there are more observed variables than sources (undercomplete), and when certain sparsity and/or source independence assumptions are not met for some changing sources. Moreover, we show that even in cases with flexible grouping structures (e.g., part of the sources can be divided into irreducible independent groups with various sizes), appropriate identifiability results can also be established. Theoretical claims are supported empirically on both synthetic and real-world datasets.
    Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. (arXiv:2311.00871v1 [cs.LG])
    Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.
    COSTAR: Improved Temporal Counterfactual Estimation with Self-Supervised Learning. (arXiv:2311.00886v1 [cs.LG])
    Estimation of temporal counterfactual outcomes from observed history is crucial for decision-making in many domains such as healthcare and e-commerce, particularly when randomized controlled trials (RCTs) suffer from high cost or impracticality. For real-world datasets, modeling time-dependent confounders is challenging due to complex dynamics, long-range dependencies and both past treatments and covariates affecting the future outcomes. In this paper, we introduce COunterfactual Self-supervised TrAnsformeR (COSTAR), a novel approach that integrates self-supervised learning for improved historical representations. The proposed framework combines temporal and feature-wise attention with a component-wise contrastive loss tailored for temporal treatment outcome observations, yielding superior performance in estimation accuracy and generalization to out-of-distribution data compared to existing models, as validated by empirical results on both synthetic and real-world datasets.
    Deep learning-based Edge-aware pre and post-processing methods for JPEG compressed images. (arXiv:2104.04926v2 [eess.IV] UPDATED)
    We propose a learning-based compression scheme that envelopes a standard codec between pre and post-processing deep CNNs. Specifically, we demonstrate improvements over prior approaches utilizing a compression-decompression network by introducing: (a) an edge-aware loss function to prevent blurring that is commonly occurred in prior works & (b) a super-resolution convolutional neural network (CNN) for post-processing along with a corresponding pre-processing network for improved rate-distortion performance in the low rate regime. The algorithm is assessed on a variety of datasets varying from low to high resolution namely Set 5, Set 7, Classic 5, Set 14, Live 1, Kodak, General 100, CLIC 2019. When compared to JPEG, JPEG2000, BPG, and recent CNN approach, the proposed algorithm contributes significant improvement in PSNR with an approximate gain of 20.75%, 8.47%, 3.22%, 3.23% and 24.59%, 14.46%, 10.14%, 8.57% at low and high bit-rates respectively. Similarly, this improvement in MS-SSIM is approximately 71.43%, 50%, 36.36%, 23.08%, 64.70% and 64.47%, 61.29%, 47.06%, 51.52%, 16.28% at low and high bit-rates respectively. With CLIC 2019 dataset, PSNR is found to be superior with approximately 16.67%, 10.53%, 6.78%, and 24.62%, 17.39%, 14.08% at low and high bit-rates respectively, over JPEG2000, BPG, and recent CNN approach. Similarly, the MS-SSIM is found to be superior with approximately 72%, 45.45%, 39.13%, 18.52%, and 71.43%, 50%, 41.18%, 17.07% at low and high bit-rates respectively, compared to the same approaches. A similar type of improvement is achieved with other datasets also.
    Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy. (arXiv:2311.01002v1 [cs.LG])
    Data pruning, which aims to downsize a large training set into a small informative subset, is crucial for reducing the enormous computational costs of modern deep learning. Though large-scale data collections invariably contain annotation noise and numerous robust learning methods have been developed, data pruning for the noise-robust learning scenario has received little attention. With state-of-the-art Re-labeling methods that self-correct erroneous labels while training, it is challenging to identify which subset induces the most accurate re-labeling of erroneous labels in the entire training set. In this paper, we formalize the problem of data pruning with re-labeling. We first show that the likelihood of a training example being correctly re-labeled is proportional to the prediction confidence of its neighborhood in the subset. Therefore, we propose a novel data pruning algorithm, Prune4Rel, that finds a subset maximizing the total neighborhood confidence of all training examples, thereby maximizing the re-labeling accuracy and generalization performance. Extensive experiments on four real and one synthetic noisy datasets show that \algname{} outperforms the baselines with Re-labeling models by up to 9.1% as well as those with a standard model by up to 21.6%.
    Self-Supervised Pre-Training with Contrastive and Masked Autoencoder Methods for Dealing with Small Datasets in Deep Learning for Medical Imaging. (arXiv:2308.06534v4 [cs.CV] UPDATED)
    Deep learning in medical imaging has the potential to minimize the risk of diagnostic errors, reduce radiologist workload, and accelerate diagnosis. Training such deep learning models requires large and accurate datasets, with annotations for all training samples. However, in the medical imaging domain, annotated datasets for specific tasks are often small due to the high complexity of annotations, limited access, or the rarity of diseases. To address this challenge, deep learning models can be pre-trained on large image datasets without annotations using methods from the field of self-supervised learning. After pre-training, small annotated datasets are sufficient to fine-tune the models for a specific task. The most popular self-supervised pre-training approaches in medical imaging are based on contrastive learning. However, recent studies in natural image processing indicate a strong potential for masked autoencoder approaches. Our work compares state-of-the-art contrastive learning methods with the recently introduced masked autoencoder approach "SparK" for convolutional neural networks (CNNs) on medical images. Therefore we pre-train on a large unannotated CT image dataset and fine-tune on several CT classification tasks. Due to the challenge of obtaining sufficient annotated training data in medical imaging, it is of particular interest to evaluate how the self-supervised pre-training methods perform when fine-tuning on small datasets. By experimenting with gradually reducing the training dataset size for fine-tuning, we find that the reduction has different effects depending on the type of pre-training chosen. The SparK pre-training method is more robust to the training dataset size than the contrastive methods. Based on our results, we propose the SparK pre-training for medical imaging tasks with only small annotated datasets.
    Learned Visual Features to Textual Explanations. (arXiv:2309.00733v2 [cs.CV] UPDATED)
    Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of large language models (LLMs) to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and LLMs. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers.
    Gaussian Mixture Solvers for Diffusion Models. (arXiv:2311.00941v1 [cs.LG])
    Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https://github.com/Guohanzhong/GMS.
    An energy-based comparative analysis of common approaches to text classification in the Legal domain. (arXiv:2311.01256v1 [cs.CL])
    Most Machine Learning research evaluates the best solutions in terms of performance. However, in the race for the best performing model, many important aspects are often overlooked when, on the contrary, they should be carefully considered. In fact, sometimes the gaps in performance between different approaches are neglectable, whereas factors such as production costs, energy consumption, and carbon footprint must take into consideration. Large Language Models (LLMs) are extensively adopted to address NLP problems in academia and industry. In this work, we present a detailed quantitative comparison of LLM and traditional approaches (e.g. SVM) on the LexGLUE benchmark, which takes into account both performance (standard indices) and alternative metrics such as timing, power consumption and cost, in a word: the carbon-footprint. In our analysis, we considered the prototyping phase (model selection by training-validation-test iterations) and in-production phases separately, since they follow different implementation procedures and also require different resources. The results indicate that very often, the simplest algorithms achieve performance very close to that of large LLMs but with very low power consumption and lower resource demands. The results obtained could suggest companies to include additional evaluations in the choice of Machine Learning (ML) solutions.
    Invariant-Feature Subspace Recovery: A New Class of Provable Domain Generalization Algorithms. (arXiv:2311.00966v1 [cs.LG])
    Domain generalization asks for models trained over a set of training environments to generalize well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) have been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this work, we propose Invariant-feature Subspace Recovery (ISR): a new class of algorithms to achieve provable domain generalization across the settings of classification and regression problems. First, in the binary classification setup of Rosenfeld et al. (2021), we show that our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments. Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Next, we extend ISR-Mean to the more general setting of multi-class classification and propose ISR-Multiclass, which leverages class information and provably recovers the invariant-feature subspace with $\lceil d_s/k\rceil+1$ training environments for $k$-class classification. Finally, for regression problems, we propose ISR-Regression that can identify the invariant-feature subspace with $d_s+1$ training environments. Empirically, we demonstrate the superior performance of our ISRs on synthetic benchmarks. Further, ISR can be used as post-processing methods for feature extractors such as neural nets.
    E3 TTS: Easy End-to-End Diffusion-based Text to Speech. (arXiv:2311.00945v1 [cs.SD])
    We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.
    A Multi-Agent Reinforcement Learning Framework for Evaluating the U.S. Ending the HIV Epidemic Plan. (arXiv:2311.00855v1 [cs.AI])
    Human immunodeficiency virus (HIV) is a major public health concern in the United States, with about 1.2 million people living with HIV and 35,000 newly infected each year. There are considerable geographical disparities in HIV burden and care access across the U.S. The 2019 Ending the HIV Epidemic (EHE) initiative aims to reduce new infections by 90% by 2030, by improving coverage of diagnoses, treatment, and prevention interventions and prioritizing jurisdictions with high HIV prevalence. Identifying optimal scale-up of intervention combinations will help inform resource allocation. Existing HIV decision analytic models either evaluate specific cities or the overall national population, thus overlooking jurisdictional interactions or differences. In this paper, we propose a multi-agent reinforcement learning (MARL) model, that enables jurisdiction-specific decision analyses but in an environment with cross-jurisdictional epidemiological interactions. In experimental analyses, conducted on jurisdictions within California and Florida, optimal policies from MARL were significantly different than those generated from single-agent RL, highlighting the influence of jurisdictional variations and interactions. By using comprehensive modeling of HIV and formulations of state space, action space, and reward functions, this work helps demonstrate the strengths and applicability of MARL for informing public health policies, and provides a framework for expanding to the national-level to inform the EHE.
    H-NeXt: The next step towards roto-translation invariant networks. (arXiv:2311.01111v1 [cs.CV])
    The widespread popularity of equivariant networks underscores the significance of parameter efficient models and effective use of training data. At a time when robustness to unseen deformations is becoming increasingly important, we present H-NeXt, which bridges the gap between equivariance and invariance. H-NeXt is a parameter-efficient roto-translation invariant network that is trained without a single augmented image in the training set. Our network comprises three components: an equivariant backbone for learning roto-translation independent features, an invariant pooling layer for discarding roto-translation information, and a classification layer. H-NeXt outperforms the state of the art in classification on unaugmented training sets and augmented test sets of MNIST and CIFAR-10.
    Batch Bayesian Optimization for Replicable Experimental Design. (arXiv:2311.01195v1 [cs.LG])
    Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replicating each more times. Moreover, in these problems, practitioners may be risk-averse and hence prefer an input with both good average performance and small variability. To tackle both challenges, we propose the Batch Thompson Sampling for Replicable Experimental Design (BTS-RED) framework, which encompasses three algorithms. Our BTS-RED-Known and BTS-RED-Unknown algorithms, for, respectively, known and unknown noise variance, choose the number of replications adaptively rather than deterministically such that an input with a larger noise variance is replicated more times. As a result, despite the noise heteroscedasticity, both algorithms enjoy a theoretical guarantee and are asymptotically no-regret. Our Mean-Var-BTS-RED algorithm aims at risk-averse optimization and is also asymptotically no-regret. We also show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.
    Are These the Same Apple? Comparing Images Based on Object Intrinsics. (arXiv:2311.00750v1 [cs.CV])
    The human visual system can effortlessly recognize an object under different extrinsic factors such as lighting, object poses, and background, yet current computer vision systems often struggle with these variations. An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics. To benchmark such measurements, we collect the Common paired objects Under differenT Extrinsics (CUTE) dataset of $18,000$ images of $180$ objects under different extrinsic factors such as lighting, poses, and imaging conditions. While existing methods such as LPIPS and CLIP scores do not measure object intrinsics well, we find that combining deep features learned from contrastive self-supervised learning with foreground filtering is a simple yet effective approach to approximating the similarity. We conduct an extensive survey of pre-trained features and foreground extraction methods to arrive at a strong baseline that best measures intrinsic object-centric image similarity among current methods. Finally, we demonstrate that our approach can aid in downstream applications such as acting as an analog for human subjects and improving generalizable re-identification. Please see our project website at https://s-tian.github.io/projects/cute/ for visualizations of the data and demos of our metric.
    Application and Energy-Aware Data Aggregation using Vector Synchronization in Distributed Battery-less IoT Networks. (arXiv:2311.01050v1 [cs.NI])
    The battery-less Internet of Things (IoT) devices are a key element in the sustainable green initiative for the next-generation wireless networks. These battery-free devices use the ambient energy, harvested from the environment. The energy harvesting environment is dynamic and causes intermittent task execution. The harvested energy is stored in small capacitors and it is challenging to assure the application task execution. The main goal is to provide a mechanism to aggregate the sensor data and provide a sustainable application support in the distributed battery-less IoT network. We model the distributed IoT network system consisting of many battery-free IoT sensor hardware modules and heterogeneous IoT applications that are being supported in the device-edge-cloud continuum. The applications require sensor data from a distributed set of battery-less hardware modules and there is provision of joint control over the module actuators. We propose an application-aware task and energy manager (ATEM) for the IoT devices and a vector-synchronization based data aggregator (VSDA). The ATEM is supported by device-level federated energy harvesting and system-level energy-aware heterogeneous application management. In our proposed framework the data aggregator forecasts the available power from the ambient energy harvester using long-short-term-memory (LSTM) model and sets the device profile as well as the application task rates accordingly. Our proposed scheme meets the heterogeneous application requirements with negligible overhead; reduces the data loss and packet delay; increases the hardware component availability; and makes the components available sooner as compared to the state-of-the-art.
    Vision-Language Foundation Models as Effective Robot Imitators. (arXiv:2311.01378v1 [cs.RO])
    Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.
    Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization. (arXiv:2311.00944v1 [stat.ML])
    In recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved its success in centralized nonconvex minimax optimization, how and whether smoothing technique could be helpful in federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly used to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.
    Conformal Prediction for Time Series with Modern Hopfield Networks. (arXiv:2303.12783v2 [cs.LG] UPDATED)
    To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. We propose HopCPT, a novel conformal prediction approach for time series that not only copes with temporal structures but leverages them. We show that our approach is theoretically well justified for time series where temporal dependencies are present. In experiments, we demonstrate that our new approach outperforms state-of-the-art conformal prediction methods on multiple real-world time series datasets from four different domains.
    Deep Learning for real-time neural decoding of grasp. (arXiv:2311.01061v1 [cs.LG])
    Neural decoding involves correlating signals acquired from the brain to variables in the physical world like limb movement or robot control in Brain Machine Interfaces. In this context, this work starts from a specific pre-existing dataset of neural recordings from monkey motor cortex and presents a Deep Learning-based approach to the decoding of neural signals for grasp type classification. Specifically, we propose here an approach that exploits LSTM networks to classify time series containing neural data (i.e., spike trains) into classes representing the object being grasped. The main goal of the presented approach is to improve over state-of-the-art decoding accuracy without relying on any prior neuroscience knowledge, and leveraging only the capability of deep learning models to extract correlations from data. The paper presents the results achieved for the considered dataset and compares them with previous works on the same dataset, showing a significant improvement in classification accuracy, even if considering simulated real-time decoding.
    A Study of Continual Learning Under Language Shift. (arXiv:2311.01200v1 [cs.CL])
    The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. In this paper, we study the benefits and downsides of updating a language model when new data comes from new languages - the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Norwegian and Icelandic to investigate how forward and backward transfer effects depend on the pre-training order and characteristics of languages, for different model sizes and learning rate schedulers. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be either positive or negative depending on the order and characteristics of new languages. To explain these patterns we explore several language similarity metrics and find that syntactic similarity appears to have the best correlation with our results.
    Respiratory Anomaly Detection using Reflected Infrared Light-wave Signals. (arXiv:2311.01367v1 [eess.SP])
    In this study, we present a non-contact respiratory anomaly detection method using incoherent light-wave signals reflected from the chest of a mechanical robot that can breathe like human beings. In comparison to existing radar and camera-based sensing systems for vitals monitoring, this technology uses only a low-cost ubiquitous light source (e.g., infrared light emitting diode) and sensor (e.g., photodetector). This light-wave sensing (LWS) system recognizes different breathing anomalies from the variations of light intensity reflected from the chest of the robot within a 0.5m-1.5m range. The anomaly detection model demonstrates up to 96.6% average accuracy in classifying 7 different types of breathing data using machine learning. The model can also detect faulty data collected by the system that does not contain breathing information. The developed system can be utilized at home or healthcare facilities as a smart, non-contact and discreet respiration monitoring method.
    SIESTA: Efficient Online Continual Learning with Sleep. (arXiv:2303.10725v3 [cs.CV] UPDATED)
    In supervised continual learning, a deep neural network (DNN) is updated with an ever-growing data stream. Unlike the offline setting where data is shuffled, we cannot make any distributional assumptions about the data stream. Ideally, only one pass through the dataset is needed for computational efficiency. However, existing methods are inadequate and make many assumptions that cannot be made for real-world applications, while simultaneously failing to improve computational efficiency. In this paper, we propose a novel continual learning method, SIESTA based on wake/sleep framework for training, which is well aligned to the needs of on-device learning. The major goal of SIESTA is to advance compute efficient continual learning so that DNNs can be updated efficiently using far less time and energy. The principal innovations of SIESTA are: 1) rapid online updates using a rehearsal-free, backpropagation-free, and data-driven network update rule during its wake phase, and 2) expedited memory consolidation using a compute-restricted rehearsal policy during its sleep phase. For memory efficiency, SIESTA adapts latent rehearsal using memory indexing from REMIND. Compared to REMIND and prior arts, SIESTA is far more computationally efficient, enabling continual learning on ImageNet-1K in under 2 hours on a single GPU; moreover, in the augmentation-free setting it matches the performance of the offline learner, a milestone critical to driving adoption of continual learning in real-world applications.
    Training Dynamics of Contextual N-Grams in Language Models. (arXiv:2311.00863v1 [cs.LG])
    Prior work has shown the existence of contextual neurons in language models, including a neuron that activates on German text. We show that this neuron exists within a broader contextual n-gram circuit: we find late layer neurons which recognize and continue n-grams common in German text, but which only activate if the German neuron is active. We investigate the formation of this circuit throughout training and find that it is an example of what we call a second-order circuit. In particular, both the constituent n-gram circuits and the German detection circuit which culminates in the German neuron form with independent functions early in training - the German detection circuit partially through modeling German unigram statistics, and the n-grams by boosting appropriate completions. Only after both circuits have already formed do they fit together into a second-order circuit. Contrary to the hypotheses presented in prior work, we find that the contextual n-gram circuit forms gradually rather than in a sudden phase transition. We further present a range of anomalous observations such as a simultaneous phase transition in many tasks coinciding with the learning rate warm-up, and evidence that many context neurons form simultaneously early in training but are later unlearned.
    A Review and Roadmap of Deep Causal Model from Different Causal Structures and Representations. (arXiv:2311.00923v1 [cs.LG])
    The fusion of causal models with deep learning introducing increasingly intricate data sets, such as the causal associations within images or between textual components, has surfaced as a focal research area. Nonetheless, the broadening of original causal concepts and theories to such complex, non-statistical data has been met with serious challenges. In response, our study proposes redefinitions of causal data into three distinct categories from the standpoint of causal structure and representation: definite data, semi-definite data, and indefinite data. Definite data chiefly pertains to statistical data used in conventional causal scenarios, while semi-definite data refers to a spectrum of data formats germane to deep learning, including time-series, images, text, and others. Indefinite data is an emergent research sphere inferred from the progression of data forms by us. To comprehensively present these three data paradigms, we elaborate on their formal definitions, differences manifested in datasets, resolution pathways, and development of research. We summarize key tasks and achievements pertaining to definite and semi-definite data from myriad research undertakings, present a roadmap for indefinite data, beginning with its current research conundrums. Lastly, we classify and scrutinize the key datasets presently utilized within these three paradigms.
    Monotone Generative Modeling via a Gromov-Monge Embedding. (arXiv:2311.01375v1 [cs.LG])
    Generative Adversarial Networks (GANs) are powerful tools for creating new content, but they face challenges such as sensitivity to starting conditions and mode collapse. To address these issues, we propose a deep generative model that utilizes the Gromov-Monge embedding (GME). It helps identify the low-dimensional structure of the underlying measure of the data and then maps it, while preserving its geometry, into a measure in a low-dimensional latent space, which is then optimally transported to the reference measure. We guarantee the preservation of the underlying geometry by the GME and $c$-cyclical monotonicity of the generative map, where $c$ is an intrinsic embedding cost employed by the GME. The latter property is a first step in guaranteeing better robustness to initialization of parameters and mode collapse. Numerical experiments demonstrate the effectiveness of our approach in generating high-quality images, avoiding mode collapse, and exhibiting robustness to different starting conditions.
    Bounding Wasserstein distance with couplings. (arXiv:2112.03152v3 [stat.CO] UPDATED)
    Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates of intractable posterior expectations as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in approximating MCMC in a manner that improves computational speed per iteration but does not produce asymptotically consistent estimates. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wasserstein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for Bayesian logistic regression in 4500 dimensions and Bayesian linear regression in 50000 dimensions.
    A Review of Digital Twins and their Application in Cybersecurity based on Artificial Intelligence. (arXiv:2311.01154v1 [cs.CR])
    The potential of digital twin technology is yet to be fully realized due to its diversity and untapped potential. Digital twins enable systems' analysis, design, optimization, and evolution to be performed digitally or in conjunction with a cyber-physical approach to improve speed, accuracy, and efficiency over traditional engineering methods. Industry 4.0, factories of the future, and digital twins continue to benefit from the technology and provide enhanced efficiency within existing systems. Due to the lack of information and security standards associated with the transition to cyber digitization, cybercriminals have been able to take advantage of the situation. Access to a digital twin of a product or service is equivalent to threatening the entire collection. There is a robust interaction between digital twins and artificial intelligence tools, which leads to strong interaction between these technologies, so it can be used to improve the cybersecurity of these digital platforms based on their integration with these technologies. This study aims to investigate the role of artificial intelligence in providing cybersecurity for digital twin versions of various industries, as well as the risks associated with these versions. In addition, this research serves as a road map for researchers and others interested in cybersecurity and digital security.
    Federated Learning on Edge Sensing Devices: A Review. (arXiv:2311.01201v1 [cs.LG])
    The ability to monitor ambient characteristics, interact with them, and derive information about the surroundings has been made possible by the rapid proliferation of edge sensing devices like IoT, mobile, and wearable devices and their measuring capabilities with integrated sensors. Even though these devices are small and have less capacity for data storage and processing, they produce vast amounts of data. Some example application areas where sensor data is collected and processed include healthcare, environmental (including air quality and pollution levels), automotive, industrial, aerospace, and agricultural applications. These enormous volumes of sensing data collected from the edge devices are analyzed using a variety of Machine Learning (ML) and Deep Learning (DL) approaches. However, analyzing them on the cloud or a server presents challenges related to privacy, hardware, and connectivity limitations. Federated Learning (FL) is emerging as a solution to these problems while preserving privacy by jointly training a model without sharing raw data. In this paper, we review the FL strategies from the perspective of edge sensing devices to get over the limitations of conventional machine learning techniques. We focus on the key FL principles, software frameworks, and testbeds. We also explore the current sensor technologies, properties of the sensing devices and sensing applications where FL is utilized. We conclude with a discussion on open issues and future research directions on FL for further studies
    Diffusion Models for Reinforcement Learning: A Survey. (arXiv:2311.01223v1 [cs.LG])
    Diffusion models have emerged as a prominent class of generative models, surpassing previous methods regarding sample quality and training stability. Recent works have shown the advantages of diffusion models in improving reinforcement learning (RL) solutions, including as trajectory planners, expressive policy classes, data synthesizers, etc. This survey aims to provide an overview of the advancements in this emerging field and hopes to inspire new avenues of research. First, we examine several challenges encountered by current RL algorithms. Then, we present a taxonomy of existing methods based on the roles played by diffusion models in RL and explore how the existing challenges are addressed. We further outline successful applications of diffusion models in various RL-related tasks while discussing the limitations of current approaches. Finally, we conclude the survey and offer insights into future research directions, focusing on enhancing model performance and applying diffusion models to broader tasks. We are actively maintaining a GitHub repository for papers and other related resources in applying diffusion models in RL: https://github.com/apexrl/Diff4RLSurvey .
    Releasing Graph Neural Networks with Differential Privacy Guarantees. (arXiv:2109.08907v2 [cs.LG] UPDATED)
    With the increasing popularity of graph neural networks (GNNs) in several sensitive applications like healthcare and medicine, concerns have been raised over the privacy aspects of trained GNNs. More notably, GNNs are vulnerable to privacy attacks, such as membership inference attacks, even if only black-box access to the trained model is granted. We propose PrivGNN, a privacy-preserving framework for releasing GNN models in a centralized setting. Assuming an access to a public unlabeled graph, PrivGNN provides a framework to release GNN models trained explicitly on public data along with knowledge obtained from the private data in a privacy preserving manner. PrivGNN combines the knowledge-distillation framework with the two noise mechanisms, random subsampling, and noisy labeling, to ensure rigorous privacy guarantees. We theoretically analyze our approach in the Renyi differential privacy framework. Besides, we show the solid experimental performance of our method compared to several baselines adapted for graph-structured data. Our code is available at https://github.com/iyempissy/privGnn.
    Score-based Data Assimilation for a Two-Layer Quasi-Geostrophic Model. (arXiv:2310.01853v2 [stat.ML] UPDATED)
    Data assimilation addresses the problem of identifying plausible state trajectories of dynamical systems given noisy or incomplete observations. In geosciences, it presents challenges due to the high-dimensionality of geophysical dynamical systems, often exceeding millions of dimensions. This work assesses the scalability of score-based data assimilation (SDA), a novel data assimilation method, in the context of such systems. We propose modifications to the score network architecture aimed at significantly reducing memory consumption and execution time. We demonstrate promising results for a two-layer quasi-geostrophic model.
    An Integrated Framework Integrating Monte Carlo Tree Search and Supervised Learning for Train Timetabling Problem. (arXiv:2311.00971v1 [cs.LG])
    The single-track railway train timetabling problem (TTP) is an important and complex problem. This article proposes an integrated Monte Carlo Tree Search (MCTS) computing framework that combines heuristic methods, unsupervised learning methods, and supervised learning methods for solving TTP in discrete action spaces. This article first describes the mathematical model and simulation system dynamics of TTP, analyzes the characteristics of the solution from the perspective of MCTS, and proposes some heuristic methods to improve MCTS. This article considers these methods as planners in the proposed framework. Secondly, this article utilizes deep convolutional neural networks to approximate the value of nodes and further applies them to the MCTS search process, referred to as learners. The experiment shows that the proposed heuristic MCTS method is beneficial for solving TTP; The algorithm framework that integrates planners and learners can improve the data efficiency of solving TTP; The proposed method provides a new paradigm for solving TTP.
    Selectively Sharing Experiences Improves Multi-Agent Reinforcement Learning. (arXiv:2311.00865v1 [cs.LG])
    We present a novel multi-agent RL approach, Selective Multi-Agent Prioritized Experience Relay, in which agents share with other agents a limited number of transitions they observe during training. The intuition behind this is that even a small number of relevant experiences from other agents could help each agent learn. Unlike many other multi-agent RL algorithms, this approach allows for largely decentralized training, requiring only a limited communication channel between agents. We show that our approach outperforms baseline no-sharing decentralized training and state-of-the art multi-agent RL algorithms. Further, sharing only a small number of highly relevant experiences outperforms sharing all experiences between agents, and the performance uplift from selective experience sharing is robust across a range of hyperparameters and DQN variants. A reference implementation of our algorithm is available at https://github.com/mgerstgrasser/super.
    Representing Edge Flows on Graphs via Sparse Cell Complexes. (arXiv:2309.01632v3 [cs.SI] UPDATED)
    Obtaining sparse, interpretable representations of observable data is crucial in many machine learning and signal processing tasks. For data representing flows along the edges of a graph, an intuitively interpretable way to obtain such representations is to lift the graph structure to a simplicial complex: The eigenvectors of the associated Hodge-Laplacian, respectively the incidence matrices of the corresponding simplicial complex then induce a Hodge decomposition, which can be used to represent the observed data in terms of gradient, curl, and harmonic flows. In this paper, we generalize this approach to cellular complexes and introduce the flow representation learning problem, i.e., the problem of augmenting the observed graph by a set of cells, such that the eigenvectors of the associated Hodge Laplacian provide a sparse, interpretable representation of the observed edge flows on the graph. We show that this problem is NP-hard and introduce an efficient approximation algorithm for its solution. Experiments on real-world and synthetic data demonstrate that our algorithm outperforms state-of-the-art methods with respect to approximation error, while being computationally efficient.
    PET Tracer Conversion among Brain PET via Variable Augmented Invertible Network. (arXiv:2311.00735v1 [cs.LG])
    Positron emission tomography (PET), as an imaging technique with high biochemical sensitivity, has been widely used in diagnosis of encephalopathy and brain science research used in brain disease diagnosis and brain science research. Since different tracers present different effects on the same focal area, the choice of tracers is getting more significant for PET imaging. Nowadays, with the wide application of PET imaging in neuropsychiatric treatment, 6-18F-fluoro-3, 4-dihydroxy-L-phenylalanine (DOPA) has been found to be more effective than 18F-labeled fluorine-2-deoxyglucose (FDG) in this field. However, due to the complexity of its preparation and other limitations, DOPA is far less widely used than FDG. To address this issue, a tracer conversion invertible neural network (TC-INN) for image projection is developed to map FDG images to DOPA images through deep learning. More diagnostic information is obtained by generating PET images from FDG to DOPA. Specifically, the proposed TC-INN consists of two separate phases, one for training the traceable data, the other for re-building the new data. The reference DOPA PET image is used as the learning target for the corresponding network during the training process of tracer conversion. Mean-while, the invertible network iteratively estimates the resultant DOPA PET data and compares it to the reference DOPA PET data. Notably, the reversible model employed variable enhancement techniques to achieve better power generation. Moreover, image registration needs to be performed before training due to the angular deviation of the acquired FDG and DOPA data information. Experimental results show generative ability in mapping be-tween FDG images and DOPA images. It demonstrates great potential for PET image conversion in the case of limited tracer applications.
    VQA-GEN: A Visual Question Answering Benchmark for Domain Generalization. (arXiv:2311.00807v1 [cs.CV])
    Visual question answering (VQA) models are designed to demonstrate visual-textual reasoning capabilities. However, their real-world applicability is hindered by a lack of comprehensive benchmark datasets. Existing domain generalization datasets for VQA exhibit a unilateral focus on textual shifts while VQA being a multi-modal task contains shifts across both visual and textual domains. We propose VQA-GEN, the first ever multi-modal benchmark dataset for distribution shift generated through a shift induced pipeline. Experiments demonstrate VQA-GEN dataset exposes the vulnerability of existing methods to joint multi-modal distribution shifts. validating that comprehensive multi-modal shifts are critical for robust VQA generalization. Models trained on VQA-GEN exhibit improved cross-domain and in-domain performance, confirming the value of VQA-GEN. Further, we analyze the importance of each shift technique of our pipeline contributing to the generalization of the model.
    MIST: Defending Against Membership Inference Attacks Through Membership-Invariant Subspace Training. (arXiv:2311.00919v1 [cs.CR])
    In Member Inference (MI) attacks, the adversary try to determine whether an instance is used to train a machine learning (ML) model. MI attacks are a major privacy concern when using private data to train ML models. Most MI attacks in the literature take advantage of the fact that ML models are trained to fit the training data well, and thus have very low loss on training instances. Most defenses against MI attacks therefore try to make the model fit the training data less well. Doing so, however, generally results in lower accuracy. We observe that training instances have different degrees of vulnerability to MI attacks. Most instances will have low loss even when not included in training. For these instances, the model can fit them well without concerns of MI attacks. An effective defense only needs to (possibly implicitly) identify instances that are vulnerable to MI attacks and avoids overfitting them. A major challenge is how to achieve such an effect in an efficient training process. Leveraging two distinct recent advancements in representation learning: counterfactually-invariant representations and subspace learning methods, we introduce a novel Membership-Invariant Subspace Training (MIST) method to defend against MI attacks. MIST avoids overfitting the vulnerable instances without significant impact on other instances. We have conducted extensive experimental studies, comparing MIST with various other state-of-the-art (SOTA) MI defenses against several SOTA MI attacks. We find that MIST outperforms other defenses while resulting in minimal reduction in testing accuracy.
    Federated Linear Bandits with Finite Adversarial Actions. (arXiv:2311.00973v1 [cs.LG])
    We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of $\tilde{O}(\sqrt{d T})$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as $O(d M^2 \log(d)\log(T))$ and $O(\sqrt{d^3 M^3} \log(d))$, respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of $\tilde{O} (\sqrt{d \sum \nolimits_{t=1}^{T} \sigma_t^2})$ can be achieved with $\sigma_t^2$ being the noise variance of round $t$; and (2) adversarial corruption, where a total regret of $\tilde{O}(\sqrt{dT} + d C_p)$ can be achieved with $C_p$ being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.
    Autonomous Learning of Generative Models with Chemical Reaction Network Ensembles. (arXiv:2311.00975v1 [q-bio.MN])
    Can a micron sized sack of interacting molecules autonomously learn an internal model of a complex and fluctuating environment? We draw insights from control theory, machine learning theory, chemical reaction network theory, and statistical physics to develop a general architecture whereby a broad class of chemical systems can autonomously learn complex distributions. Our construction takes the form of a chemical implementation of machine learning's optimization workhorse: gradient descent on the relative entropy cost function. We show how this method can be applied to optimize any detailed balanced chemical reaction network and that the construction is capable of using hidden units to learn complex distributions. This result is then recast as a form of integral feedback control. Finally, due to our use of an explicit physical model of learning, we are able to derive thermodynamic costs and trade-offs associated to this process.
    Better with Less: A Data-Active Perspective on Pre-Training Graph Neural Networks. (arXiv:2311.01038v1 [cs.LG])
    Pre-training on graph neural networks (GNNs) aims to learn transferable knowledge for downstream tasks with unlabeled data, and it has recently become an active research area. The success of graph pre-training models is often attributed to the massive amount of input data. In this paper, however, we identify the curse of big data phenomenon in graph pre-training: more training data do not necessarily lead to better downstream performance. Motivated by this observation, we propose a better-with-less framework for graph pre-training: fewer, but carefully chosen data are fed into a GNN model to enhance pre-training. The proposed pre-training pipeline is called the data-active graph pre-training (APT) framework, and is composed of a graph selector and a pre-training model. The graph selector chooses the most representative and instructive data points based on the inherent properties of graphs as well as predictive uncertainty. The proposed predictive uncertainty, as feedback from the pre-training model, measures the confidence level of the model in the data. When fed with the chosen data, on the other hand, the pre-training model grasps an initial understanding of the new, unseen data, and at the same time attempts to remember the knowledge learned from previous data. Therefore, the integration and interaction between these two components form a unified framework (APT), in which graph pre-training is performed in a progressive and iterative way. Experiment results show that the proposed APT is able to obtain an efficient pre-training model with fewer training data and better downstream performance.
    A Coreset-based, Tempered Variational Posterior for Accurate and Scalable Stochastic Gaussian Process Inference. (arXiv:2311.01409v1 [cs.LG])
    We present a novel stochastic variational Gaussian process ($\mathcal{GP}$) inference method, based on a posterior over a learnable set of weighted pseudo input-output points (coresets). Instead of a free-form variational family, the proposed coreset-based, variational tempered family for $\mathcal{GP}$s (CVTGP) is defined in terms of the $\mathcal{GP}$ prior and the data-likelihood; hence, accommodating the modeling inductive biases. We derive CVTGP's lower bound for the log-marginal likelihood via marginalization of the proposed posterior over latent $\mathcal{GP}$ coreset variables, and show it is amenable to stochastic optimization. CVTGP reduces the learnable parameter size to $\mathcal{O}(M)$, enjoys numerical stability, and maintains $\mathcal{O}(M^3)$ time- and $\mathcal{O}(M^2)$ space-complexity, by leveraging a coreset-based tempered posterior that, in turn, provides sparse and explainable representations of the data. Results on simulated and real-world regression problems with Gaussian observation noise validate that CVTGP provides better evidence lower-bound estimates and predictive root mean squared error than alternative stochastic $\mathcal{GP}$ inference methods.
    Getting aligned on representational alignment. (arXiv:2310.13018v2 [q-bio.NC] UPDATED)
    Biological and artificial information processing systems form representations that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the extent to which the representations formed by these diverse systems agree? Do similarities in representations then translate into similar behavior? How can a system's representations be modified to better match those of another system? These questions pertaining to the study of representational alignment are at the heart of some of the most active research areas in cognitive science, neuroscience, and machine learning. For example, cognitive scientists measure the representational alignment of multiple individuals to identify shared cognitive priors, neuroscientists align fMRI responses from multiple individuals into a shared representational space for group-level analyses, and ML researchers distill knowledge from teacher models into student models by increasing their alignment. Unfortunately, there is limited knowledge transfer between research communities interested in representational alignment, so progress in one field often ends up being rediscovered independently in another. Thus, greater cross-field communication would be advantageous. To improve communication between these fields, we propose a unifying framework that can serve as a common language between researchers studying representational alignment. We survey the literature from all three fields and demonstrate how prior work fits into this framework. Finally, we lay out open problems in representational alignment where progress can benefit all three of these fields. We hope that our work can catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems. We note that this is a working paper and encourage readers to reach out with their suggestions for future revisions.
    Understanding and Improving Ensemble Adversarial Defense. (arXiv:2310.18477v2 [cs.LG] UPDATED)
    The strategy of ensemble has become popular in adversarial defense, which trains multiple base classifiers to defend against adversarial attacks in a cooperative manner. Despite the empirical success, theoretical explanations on why an ensemble of adversarially trained classifiers is more robust than single ones remain unclear. To fill in this gap, we develop a new error theory dedicated to understanding ensemble adversarial defense, demonstrating a provable 0-1 loss reduction on challenging sample sets in an adversarial defense scenario. Guided by this theory, we propose an effective approach to improve ensemble adversarial defense, named interactive global adversarial training (iGAT). The proposal includes (1) a probabilistic distributing rule that selectively allocates to different base classifiers adversarial examples that are globally challenging to the ensemble, and (2) a regularization term to rescue the severest weaknesses of the base classifiers. Being tested over various existing ensemble adversarial defense techniques, iGAT is capable of boosting their performance by increases up to 17% evaluated using CIFAR10 and CIFAR100 datasets under both white-box and black-box attacks.
    Bridging the Gap: Addressing Discrepancies in Diffusion Model Training for Classifier-Free Guidance. (arXiv:2311.00938v1 [cs.LG])
    Diffusion models have emerged as a pivotal advancement in generative models, setting new standards to the quality of the generated instances. In the current paper we aim to underscore a discrepancy between conventional training methods and the desired conditional sampling behavior of these models. While the prevalent classifier-free guidance technique works well, it's not without flaws. At higher values for the guidance scale parameter $w$, we often get out of distribution samples and mode collapse, whereas at lower values for $w$ we may not get the desired specificity. To address these challenges, we introduce an updated loss function that better aligns training objectives with sampling behaviors. Experimental validation with FID scores on CIFAR-10 elucidates our method's ability to produce higher quality samples with fewer sampling timesteps, and be more robust to the choice of guidance scale $w$. We also experiment with fine-tuning Stable Diffusion on the proposed loss, to provide early evidence that large diffusion models may also benefit from this refined loss function.
    KP-RNN: A Deep Learning Pipeline for Human Motion Prediction and Synthesis of Performance Art. (arXiv:2210.04366v3 [cs.CV] UPDATED)
    Digitally synthesizing human motion is an inherently complex process, which can create obstacles in application areas such as virtual reality. We offer a new approach for predicting human motion, KP-RNN, a neural network which can integrate easily with existing image processing and generation pipelines. We utilize a new human motion dataset of performance art, Take The Lead, as well as the motion generation pipeline, the Everybody Dance Now system, to demonstrate the effectiveness of KP-RNN's motion predictions. We have found that our neural network can predict human dance movements effectively, which serves as a baseline result for future works using the Take The Lead dataset. Since KP-RNN can work alongside a system such as Everybody Dance Now, we argue that our approach could inspire new methods for rendering human avatar animation. This work also serves to benefit the visualization of performance art in digital platforms by utilizing accessible neural networks.
    Manifold-augmented Eikonal Equations: Geodesic Distances and Flows on Differentiable Manifolds. (arXiv:2310.06157v2 [cs.CG] UPDATED)
    Manifolds discovered by machine learning models provide a compact representation of the underlying data. Geodesics on these manifolds define locally length-minimising curves and provide a notion of distance, which are key for reduced-order modelling, statistical inference, and interpolation. In this work, we propose a model-based parameterisation for distance fields and geodesic flows on manifolds, exploiting solutions of a manifold-augmented Eikonal equation. We demonstrate how the geometry of the manifold impacts the distance field, and exploit the geodesic flow to obtain globally length-minimising curves directly. This work opens opportunities for statistics and reduced-order modelling on differentiable manifolds.
    Scalable Probabilistic Forecasting in Retail with Gradient Boosted Trees: A Practitioner's Approach. (arXiv:2311.00993v1 [cs.LG])
    The recent M5 competition has advanced the state-of-the-art in retail forecasting. However, we notice important differences between the competition challenge and the challenges we face in a large e-commerce company. The datasets in our scenario are larger (hundreds of thousands of time series), and e-commerce can afford to have a larger assortment than brick-and-mortar retailers, leading to more intermittent data. To scale to larger dataset sizes with feasible computational effort, firstly, we investigate a two-layer hierarchy and propose a top-down approach to forecasting at an aggregated level with less amount of series and intermittency, and then disaggregating to obtain the decision-level forecasts. Probabilistic forecasts are generated under distributional assumptions. Secondly, direct training at the lower level with subsamples can also be an alternative way of scaling. Performance of modelling with subsets is evaluated with the main dataset. Apart from a proprietary dataset, the proposed scalable methods are evaluated using the Favorita dataset and the M5 dataset. We are able to show the differences in characteristics of the e-commerce and brick-and-mortar retail datasets. Notably, our top-down forecasting framework enters the top 50 of the original M5 competition, even with models trained at a higher level under a much simpler setting.
    Wasserstein Quantum Monte Carlo: A Novel Approach for Solving the Quantum Many-Body Schr\"odinger Equation. (arXiv:2307.07050v3 [physics.comp-ph] UPDATED)
    Solving the quantum many-body Schr\"odinger equation is a fundamental and challenging problem in the fields of quantum physics, quantum chemistry, and material sciences. One of the common computational approaches to this problem is Quantum Variational Monte Carlo (QVMC), in which ground-state solutions are obtained by minimizing the energy of the system within a restricted family of parameterized wave functions. Deep learning methods partially address the limitations of traditional QVMC by representing a rich family of wave functions in terms of neural networks. However, the optimization objective in QVMC remains notoriously hard to minimize and requires second-order optimization methods such as natural gradient. In this paper, we first reformulate energy functional minimization in the space of Born distributions corresponding to particle-permutation (anti-)symmetric wave functions, rather than the space of wave functions. We then interpret QVMC as the Fisher-Rao gradient flow in this distributional space, followed by a projection step onto the variational manifold. This perspective provides us with a principled framework to derive new QMC algorithms, by endowing the distributional space with better metrics, and following the projected gradient flow induced by those metrics. More specifically, we propose "Wasserstein Quantum Monte Carlo" (WQMC), which uses the gradient flow induced by the Wasserstein metric, rather than Fisher-Rao metric, and corresponds to transporting the probability mass, rather than teleporting it. We demonstrate empirically that the dynamics of WQMC results in faster convergence to the ground state of molecular systems.
    Sharp Noisy Binary Search with Monotonic Probabilities. (arXiv:2311.00840v1 [cs.DS])
    We revisit the noisy binary search model of Karp and Kleinberg, in which we have $n$ coins with unknown probabilities $p_i$ that we can flip. The coins are sorted by increasing $p_i$, and we would like to find where the probability crosses (to within $\varepsilon$) of a target value $\tau$. This generalized the fixed-noise model of Burnashev and Zigangirov , in which $p_i = \frac{1}{2} \pm \varepsilon$, to a setting where coins near the target may be indistinguishable from it. Karp and Kleinberg showed that $\Theta(\frac{1}{\varepsilon^2} \log n)$ samples are necessary and sufficient for this task. We produce a practical algorithm by solving two theoretical challenges: high-probability behavior and sharp constants. We give an algorithm that succeeds with probability $1-\delta$ from \[ \frac{1}{C_{\tau, \varepsilon}} \cdot \left(\lg n + O(\log^{2/3} n \log^{1/3} \frac{1}{\delta} + \log \frac{1}{\delta})\right) \] samples, where $C_{\tau, \varepsilon}$ is the optimal such constant achievable. For $\delta > n^{-o(1)}$ this is within $1 + o(1)$ of optimal, and for $\delta \ll 1$ it is the first bound within constant factors of optimal.
    LocoGAN -- Locally Convolutional GAN. (arXiv:2002.07897v2 [eess.IV] UPDATED)
    In the paper we construct a fully convolutional GAN model: LocoGAN, which latent space is given by noise-like images of possibly different resolutions. The learning is local, i.e. we process not the whole noise-like image, but the sub-images of a fixed size. As a consequence LocoGAN can produce images of arbitrary dimensions e.g. LSUN bedroom data set. Another advantage of our approach comes from the fact that we use the position channels, which allows the generation of fully periodic (e.g. cylindrical panoramic images) or almost periodic ,,infinitely long" images (e.g. wall-papers).
    Deception Game: Closing the Safety-Learning Loop in Interactive Robot Autonomy. (arXiv:2309.01267v2 [cs.RO] UPDATED)
    An outstanding challenge for the widespread deployment of robotic systems like autonomous vehicles is ensuring safe interaction with humans without sacrificing performance. Existing safety methods often neglect the robot's ability to learn and adapt at runtime, leading to overly conservative behavior. This paper proposes a new closed-loop paradigm for synthesizing safe control policies that explicitly account for the robot's evolving uncertainty and its ability to quickly respond to future scenarios as they arise, by jointly considering the physical dynamics and the robot's learning algorithm. We leverage adversarial reinforcement learning for tractable safety analysis under high-dimensional learning dynamics and demonstrate our framework's ability to work with both Bayesian belief propagation and implicit learning through large pre-trained neural trajectory predictors.
    Inversion of Bayesian Networks. (arXiv:2212.10649v2 [cs.LG] UPDATED)
    Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.
    Parting with Misconceptions about Learning-based Vehicle Motion Planning. (arXiv:2306.07962v2 [cs.RO] UPDATED)
    The release of nuPlan marks a new era in vehicle motion planning research, offering the first large-scale real-world dataset and evaluation schemes requiring both precise short-term planning and long-horizon ego-forecasting. Existing systems struggle to simultaneously meet both requirements. Indeed, we find that these tasks are fundamentally misaligned and should be addressed independently. We further assess the current state of closed-loop planning in the field, revealing the limitations of learning-based methods in complex real-world scenarios and the value of simple rule-based priors such as centerline selection through lane graph search algorithms. More surprisingly, for the open-loop sub-task, we observe that the best results are achieved when using only this centerline as scene context (i.e., ignoring all information regarding the map and other agents). Combining these insights, we propose an extremely simple and efficient planner which outperforms an extensive set of competitors, winning the nuPlan planning challenge 2023.
    Dyadic Reinforcement Learning. (arXiv:2308.07843v5 [cs.LG] UPDATED)
    Mobile health aims to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proves crucial in helping individuals managing burdensome medical conditions. This presents opportunities in mobile health to design interventions that target the dyadic relationship -- the relationship between a target person and their care partner -- with the aim of enhancing social support. In this paper, we develop dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impact the dyad across multiple time intervals. The developed dyadic RL is Bayesian and hierarchical. We formally introduce the problem setup, develop dyadic RL and establish a regret bound. We demonstrate dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.
    Normalizing flows as approximations of optimal transport maps via linear-control neural ODEs. (arXiv:2311.01404v1 [math.OC])
    The term "Normalizing Flows" is related to the task of constructing invertible transport maps between probability measures by means of deep neural networks. In this paper, we consider the problem of recovering the $W_2$-optimal transport map $T$ between absolutely continuous measures $\mu,\nu\in\mathcal{P}(\mathbb{R}^n)$ as the flow of a linear-control neural ODE. We first show that, under suitable assumptions on $\mu,\nu$ and on the controlled vector fields, the optimal transport map is contained in the $C^0_c$-closure of the flows generated by the system. Assuming that discrete approximations $\mu_N,\nu_N$ of the original measures $\mu,\nu$ are available, we use a discrete optimal coupling $\gamma_N$ to define an optimal control problem. With a $\Gamma$-convergence argument, we prove that its solutions correspond to flows that approximate the optimal transport map $T$. Finally, taking advantage of the Pontryagin Maximum Principle, we propose an iterative numerical scheme for the resolution of the optimal control problem, resulting in an algorithm for the practical computation of the approximated optimal transport map.
    Tipping Points of Evolving Epidemiological Networks: Machine Learning-Assisted, Data-Driven Effective Modeling. (arXiv:2311.00797v1 [cs.LG])
    We study the tipping point collective dynamics of an adaptive susceptible-infected-susceptible (SIS) epidemiological network in a data-driven, machine learning-assisted manner. We identify a parameter-dependent effective stochastic differential equation (eSDE) in terms of physically meaningful coarse mean-field variables through a deep-learning ResNet architecture inspired by numerical stochastic integrators. We construct an approximate effective bifurcation diagram based on the identified drift term of the eSDE and contrast it with the mean-field SIS model bifurcation diagram. We observe a subcritical Hopf bifurcation in the evolving network's effective SIS dynamics, that causes the tipping point behavior; this takes the form of large amplitude collective oscillations that spontaneously -- yet rarely -- arise from the neighborhood of a (noisy) stationary state. We study the statistics of these rare events both through repeated brute force simulations and by using established mathematical/computational tools exploiting the right-hand-side of the identified SDE. We demonstrate that such a collective SDE can also be identified (and the rare events computations also performed) in terms of data-driven coarse observables, obtained here via manifold learning techniques, in particular Diffusion Maps. The workflow of our study is straightforwardly applicable to other complex dynamics problems exhibiting tipping point dynamics.
    Effective Human-AI Teams via Learned Natural Language Rules and Onboarding. (arXiv:2311.01007v1 [cs.LG])
    People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
    Learning to See Physical Properties with Active Sensing Motor Policies. (arXiv:2311.01405v1 [cs.RO])
    Knowledge of terrain's physical properties inferred from color images can aid in making efficient robotic locomotion plans. However, unlike image classification, it is unintuitive for humans to label image patches with physical properties. Without labeled data, building a vision system that takes as input the observed terrain and predicts physical properties remains challenging. We present a method that overcomes this challenge by self-supervised labeling of images captured by robots during real-world traversal with physical property estimators trained in simulation. To ensure accurate labeling, we introduce Active Sensing Motor Policies (ASMP), which are trained to explore locomotion behaviors that increase the accuracy of estimating physical parameters. For instance, the quadruped robot learns to swipe its foot against the ground to estimate the friction coefficient accurately. We show that the visual system trained with a small amount of real-world traversal data accurately predicts physical parameters. The trained system is robust and works even with overhead images captured by a drone despite being trained on data collected by cameras attached to a quadruped robot walking on the ground.
    Hierarchical Proxy Modeling for Improved HPO in Time Series Forecasting. (arXiv:2211.15092v2 [cs.LG] UPDATED)
    Selecting the right set of hyperparameters is crucial in time series forecasting. The classical temporal cross-validation framework for hyperparameter optimization (HPO) often leads to poor test performance because of a possible mismatch between validation and test periods. To address this test-validation mismatch, we propose a novel technique, H-Pro to drive HPO via test proxies by exploiting data hierarchies often associated with time series datasets. Since higher-level aggregated time series often show less irregularity and better predictability as compared to the lowest-level time series which can be sparse and intermittent, we optimize the hyperparameters of the lowest-level base-forecaster by leveraging the proxy forecasts for the test period generated from the forecasters at higher levels. H-Pro can be applied on any off-the-shelf machine learning model to perform HPO. We validate the efficacy of our technique with extensive empirical evaluation on five publicly available hierarchical forecasting datasets. Our approach outperforms existing state-of-the-art methods in Tourism, Wiki, and Traffic datasets, and achieves competitive result in Tourism-L dataset, without any model-specific enhancements. Moreover, our method outperforms the winning method of the M5 forecast accuracy competition.
    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. (arXiv:2303.17760v2 [cs.AI] UPDATED)
    The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their "cognitive" processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond: https://github.com/camel-ai/camel.
    JADE: A Linguistics-based Safety Evaluation Platform for LLM. (arXiv:2311.00286v2 [cs.CL] UPDATED)
    In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on more questions generated by JADE, please contact us. JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.
    Investigating Relative Performance of Transfer and Meta Learning. (arXiv:2311.00727v1 [cs.LG])
    Over the past decade, the field of machine learning has experienced remarkable advancements. While image recognition systems have achieved impressive levels of accuracy, they continue to rely on extensive training datasets. Additionally, a significant challenge has emerged in the form of poor out-of-distribution performance, which necessitates retraining neural networks when they encounter conditions that deviate from their training data. This limitation has notably contributed to the slow progress in self-driving car technology. These pressing issues have sparked considerable interest in methods that enable neural networks to learn effectively from limited data. This paper presents the outcomes of an extensive investigation designed to compare two distinct approaches, transfer learning and meta learning, as potential solutions to this problem. The overarching objective was to establish a robust criterion for selecting the most suitable method in diverse machine learning scenarios. Building upon prior research, I expanded the comparative analysis by introducing a new meta learning method into the investigation. Subsequently, I assessed whether the findings remained consistent under varying conditions. Finally, I delved into the impact of altering the size of the training dataset on the relative performance of these methods. This comprehensive exploration has yielded insights into the conditions favoring each approach, thereby facilitating the development of a criterion for selecting the most appropriate method in any given situation
    Conformalized Deep Splines for Optimal and Efficient Prediction Sets. (arXiv:2311.00774v1 [cs.LG])
    Uncertainty estimation is critical in high-stakes machine learning applications. One effective way to estimate uncertainty is conformal prediction, which can provide predictive inference with statistical coverage guarantees. We present a new conformal regression method, Spline Prediction Intervals via Conformal Estimation (SPICE), that estimates the conditional density using neural-network-parameterized splines. We prove universal approximation and optimality results for SPICE, which are empirically validated by our experiments. SPICE is compatible with two different efficient-to-compute conformal scores, one oracle-optimal for marginal coverage (SPICE-ND) and the other asymptotically optimal for conditional coverage (SPICE-HPD). Results on benchmark datasets demonstrate SPICE-ND models achieve the smallest average prediction set sizes, including average size reductions of nearly 50% for some datasets compared to the next best baseline. SPICE-HPD models achieve the best conditional coverage compared to baselines. The SPICE implementation is made available.
    Neural Field Dynamics Model for Granular Object Piles Manipulation. (arXiv:2311.00802v1 [cs.RO])
    We present a learning-based dynamics model for granular material manipulation. Inspired by the Eulerian approach commonly used in fluid dynamics, our method adopts a fully convolutional neural network that operates on a density field-based representation of object piles and pushers, allowing it to exploit the spatial locality of inter-object interactions as well as the translation equivariance through convolution operations. Furthermore, our differentiable action rendering module makes the model fully differentiable and can be directly integrated with a gradient-based trajectory optimization algorithm. We evaluate our model with a wide array of piles manipulation tasks both in simulation and real-world experiments and demonstrate that it significantly exceeds existing latent or particle-based methods in both accuracy and computation efficiency, and exhibits zero-shot generalization capabilities across various environments and tasks.
    Exploring Unified Perspective For Fast Shapley Value Estimation. (arXiv:2311.01010v1 [cs.LG])
    Shapley values have emerged as a widely accepted and trustworthy tool, grounded in theoretical axioms, for addressing challenges posed by black-box models like deep neural networks. However, computing Shapley values encounters exponential complexity in the number of features. Various approaches, including ApproSemivalue, KernelSHAP, and FastSHAP, have been explored to expedite the computation. We analyze the consistency of existing works and conclude that stochastic estimators can be unified as the linear transformation of importance sampling of feature subsets. Based on this, we investigate the possibility of designing simple amortized estimators and propose a straightforward and efficient one, SimSHAP, by eliminating redundant techniques. Extensive experiments conducted on tabular and image datasets validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
    Distance-Based Propagation for Efficient Knowledge Graph Reasoning. (arXiv:2311.01024v1 [cs.LG])
    Knowledge graph completion (KGC) aims to predict unseen edges in knowledge graphs (KGs), resulting in the discovery of new facts. A new class of methods have been proposed to tackle this problem by aggregating path information. These methods have shown tremendous ability in the task of KGC. However they are plagued by efficiency issues. Though there are a few recent attempts to address this through learnable path pruning, they often sacrifice the performance to gain efficiency. In this work, we identify two intrinsic limitations of these methods that affect the efficiency and representation quality. To address the limitations, we introduce a new method, TAGNet, which is able to efficiently propagate information. This is achieved by only aggregating paths in a fixed window for each source-target pair. We demonstrate that the complexity of TAGNet is independent of the number of layers. Extensive experiments demonstrate that TAGNet can cut down on the number of propagated messages by as much as 90% while achieving competitive performance on multiple KG datasets. The code is available at https://github.com/HarryShomer/TAGNet.
    Accelerating Electronic Stopping Power Predictions by 10 Million Times with a Combination of Time-Dependent Density Functional Theory and Machine Learning. (arXiv:2311.00787v1 [cond-mat.mtrl-sci])
    Knowing the rate at which particle radiation releases energy in a material, the stopping power, is key to designing nuclear reactors, medical treatments, semiconductor and quantum materials, and many other technologies. While the nuclear contribution to stopping power, i.e., elastic scattering between atoms, is well understood in the literature, the route for gathering data on the electronic contribution has for decades remained costly and reliant on many simplifying assumptions, including that materials are isotropic. We establish a method that combines time-dependent density functional theory (TDDFT) and machine learning to reduce the time to assess new materials to mere hours on a supercomputer and provides valuable data on how atomic details influence electronic stopping. Our approach uses TDDFT to compute the electronic stopping contributions to stopping power from first principles in several directions and then machine learning to interpolate to other directions at rates 10 million times higher. We demonstrate the combined approach in a study of proton irradiation in aluminum and employ it to predict how the depth of maximum energy deposition, the "Bragg Peak," varies depending on incident angle -- a quantity otherwise inaccessible to modelers. The lack of any experimental information requirement makes our method applicable to most materials, and its speed makes it a prime candidate for enabling quantum-to-continuum models of radiation damage. The prospect of reusing valuable TDDFT data for training the model make our approach appealing for applications in the age of materials data science.
    Sorting with Predictions. (arXiv:2311.00749v1 [cs.DS])
    We explore the fundamental problem of sorting through the lens of learning-augmented algorithms, where algorithms can leverage possibly erroneous predictions to improve their efficiency. We consider two different settings: In the first setting, each item is provided a prediction of its position in the sorted list. In the second setting, we assume there is a "quick-and-dirty" way of comparing items, in addition to slow-and-exact comparisons. For both settings, we design new and simple algorithms using only $O(\sum_i \log \eta_i)$ exact comparisons, where $\eta_i$ is a suitably defined prediction error for the $i$th element. In particular, as the quality of predictions deteriorates, the number of comparisons degrades smoothly from $O(n)$ to $O(n\log n)$. We prove that the comparison complexity is theoretically optimal with respect to the examined error measures. An experimental evaluation against existing adaptive and non-adaptive sorting algorithms demonstrates the potential of applying learning-augmented algorithms in sorting tasks.
    Generalizing Importance Weighting to A Universal Solver for Distribution Shift Problems. (arXiv:2305.14690v2 [cs.LG] UPDATED)
    Distribution shift (DS) may have two levels: the distribution itself changes, and the support (i.e., the set where the probability density is non-zero) also changes. When considering the support change between the training and test distributions, there can be four cases: (i) they exactly match; (ii) the training support is wider (and thus covers the test support); (iii) the test support is wider; (iv) they partially overlap. Existing methods are good at cases (i) and (ii), while cases (iii) and (iv) are more common nowadays but still under-explored. In this paper, we generalize importance weighting (IW), a golden solver for cases (i) and (ii), to a universal solver for all cases. Specifically, we first investigate why IW might fail in cases (iii) and (iv); based on the findings, we propose generalized IW (GIW) that could handle cases (iii) and (iv) and would reduce to IW in cases (i) and (ii). In GIW, the test support is split into an in-training (IT) part and an out-of-training (OOT) part, and the expected risk is decomposed into a weighted classification term over the IT part and a standard classification term over the OOT part, which guarantees the risk consistency of GIW. Then, the implementation of GIW consists of three components: (a) the split of validation data is carried out by the one-class support vector machine, (b) the first term of the empirical risk can be handled by any IW algorithm given training data and IT validation data, and (c) the second term just involves OOT validation data. Experiments demonstrate that GIW is a universal solver for DS problems, outperforming IW methods in cases (iii) and (iv).
    Data-Driven Model Selections of Second-Order Particle Dynamics via Integrating Gaussian Processes with Low-Dimensional Interacting Structures. (arXiv:2311.00902v1 [stat.ML])
    In this paper, we focus on the data-driven discovery of a general second-order particle-based model that contains many state-of-the-art models for modeling the aggregation and collective behavior of interacting agents of similar size and body type. This model takes the form of a high-dimensional system of ordinary differential equations parameterized by two interaction kernels that appraise the alignment of positions and velocities. We propose a Gaussian Process-based approach to this problem, where the unknown model parameters are marginalized by using two independent Gaussian Process (GP) priors on latent interaction kernels constrained to dynamics and observational data. This results in a nonparametric model for interacting dynamical systems that accounts for uncertainty quantification. We also develop acceleration techniques to improve scalability. Moreover, we perform a theoretical analysis to interpret the methodology and investigate the conditions under which the kernels can be recovered. We demonstrate the effectiveness of the proposed approach on various prototype systems, including the selection of the order of the systems and the types of interactions. In particular, we present applications to modeling two real-world fish motion datasets that display flocking and milling patterns up to 248 dimensions. Despite the use of small data sets, the GP-based approach learns an effective representation of the nonlinear dynamics in these spaces and outperforms competitor methods.
    Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning. (arXiv:2311.01075v1 [cs.LG])
    In the field of multi-task reinforcement learning, the modular principle, which involves specializing functionalities into different modules and combining them appropriately, has been widely adopted as a promising approach to prevent the negative transfer problem that performance degradation due to conflicts between tasks. However, most of the existing multi-task RL methods only combine shared modules at the task level, ignoring that there may be conflicts within the task. In addition, these methods do not take into account that without constraints, some modules may learn similar functions, resulting in restricting the model's expressiveness and generalization capability of modular methods. In this paper, we propose the Contrastive Modules with Temporal Attention(CMTA) method to address these limitations. CMTA constrains the modules to be different from each other by contrastive learning and combining shared modules at a finer granularity than the task level with temporal attention, alleviating the negative transfer within the task and improving the generalization ability and the performance for multi-task RL. We conducted the experiment on Meta-World, a multi-task RL benchmark containing various robotics manipulation tasks. Experimental results show that CMTA outperforms learning each task individually for the first time and achieves substantial performance improvements over the baselines.
    Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time. (arXiv:2311.01435v1 [cs.LG])
    We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $\epsilon$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/\epsilon$. The algorithm uses only the first two moments of suitable re-weightings of the empirical distribution, which we call contrastive moments; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.
    Unraveling Fundamental Properties of Power System Resilience Curves using Unsupervised Machine Learning. (arXiv:2310.10030v2 [cs.LG] UPDATED)
    The standard model of infrastructure resilience, the resilience triangle, has been the primary way of characterizing and quantifying infrastructure resilience. However, the theoretical model merely provides a one-size-fits-all framework for all infrastructure systems. Most of the existing studies examine the characteristics of infrastructure resilience curves based on analytical models constructed upon simulated system performance. Limited empirical studies hindered our ability to fully understand and predict resilience characteristics in infrastructure systems. To address this gap, this study examined over 200 resilience curves related to power outages in three major extreme weather events. Using unsupervised machine learning, we examined different curve archetypes, as well as the fundamental properties of each resilience curve archetype. The results show two primary archetypes for power system resilience curves, triangular, and trapezoidal curves. Triangular curves characterize resilience behavior based on 1. critical functionality threshold, 2. critical functionality recovery rate, and 3. recovery pivot point. Trapezoidal archetypes explain resilience curves based on 1. duration of sustained function loss and 2. constant recovery rate. The longer the duration of sustained function loss, the slower the constant rate of recovery. The findings of this study provide novel perspectives enabling better understanding and prediction of resilience performance of power system infrastructures.
    Push it to the Demonstrated Limit: Multimodal Visuotactile Imitation Learning with Force Matching. (arXiv:2311.01248v1 [cs.RO])
    Optical tactile sensors have emerged as an effective means to acquire dense contact information during robotic manipulation. A recently-introduced `see-through-your-skin' (STS) variant of this type of sensor has both visual and tactile modes, enabled by leveraging a semi-transparent surface and controllable lighting. In this work, we investigate the benefits of pairing visuotactile sensing with imitation learning for contact-rich manipulation tasks. First, we use tactile force measurements and a novel algorithm during kinesthetic teaching to yield a force profile that better matches that of the human demonstrator. Second, we add visual/tactile STS mode switching as a control policy output, simplifying the application of the sensor. Finally, we study multiple observation configurations to compare and contrast the value of visual/tactile data (both with and without mode switching) with visual data from a wrist-mounted eye-in-hand camera. We perform an extensive series of experiments on a real robotic manipulator with door-opening and closing tasks, including over 3,000 real test episodes. Our results highlight the importance of tactile sensing for imitation learning, both for data collection to allow force matching, and for policy execution to allow accurate task feedback.
    Calibrated Explanations: with Uncertainty Information and Counterfactuals. (arXiv:2305.02305v2 [cs.AI] UPDATED)
    While local explanations for AI models can offer insights into individual predictions, such as feature importance, they are plagued by issues like instability. The unreliability of feature weights, often skewed due to poorly calibrated ML models, deepens these challenges. Moreover, the critical aspect of feature importance uncertainty remains mostly unaddressed in Explainable AI (XAI). The novel feature importance explanation method presented in this paper, called Calibrated Explanations (CE), is designed to tackle these issues head-on. Built on the foundation of Venn-Abers, CE not only calibrates the underlying model but also delivers reliable feature importance explanations with an exact definition of the feature weights. CE goes beyond conventional solutions by addressing output uncertainty. It accomplishes this by providing uncertainty quantification for both feature weights and the model's probability estimates. Additionally, CE is model-agnostic, featuring easily comprehensible conditional rules and the ability to generate counterfactual explanations with embedded uncertainty quantification. Results from an evaluation with 25 benchmark datasets underscore the efficacy of CE, making it stand as a fast, reliable, stable, and robust solution.
    Enhancing Clustering Representations with Positive Proximity and Cluster Dispersion Learning. (arXiv:2311.00731v1 [cs.LG])
    Contemporary deep clustering approaches often rely on either contrastive or non-contrastive techniques to acquire effective representations for clustering tasks. Contrastive methods leverage negative pairs to achieve homogenous representations but can introduce class collision issues, potentially compromising clustering performance. On the contrary, non-contrastive techniques prevent class collisions but may produce non-uniform representations that lead to clustering collapse. In this work, we propose a novel end-to-end deep clustering approach named PIPCDR, designed to harness the strengths of both approaches while mitigating their limitations. PIPCDR incorporates a positive instance proximity loss and a cluster dispersion regularizer. The positive instance proximity loss ensures alignment between augmented views of instances and their sampled neighbors, enhancing within-cluster compactness by selecting genuinely positive pairs within the embedding space. Meanwhile, the cluster dispersion regularizer maximizes inter-cluster distances while minimizing within-cluster compactness, promoting uniformity in the learned representations. PIPCDR excels in producing well-separated clusters, generating uniform representations, avoiding class collision issues, and enhancing within-cluster compactness. We extensively validate the effectiveness of PIPCDR within an end-to-end Majorize-Minimization framework, demonstrating its competitive performance on moderate-scale clustering benchmark datasets and establishing new state-of-the-art results on large-scale datasets.
    Analysis of Information Propagation in Ethereum Network Using Combined Graph Attention Network and Reinforcement Learning to Optimize Network Efficiency and Scalability. (arXiv:2311.01406v1 [cs.LG])
    Blockchain technology has revolutionized the way information is propagated in decentralized networks. Ethereum plays a pivotal role in facilitating smart contracts and decentralized applications. Understanding information propagation dynamics in Ethereum is crucial for ensuring network efficiency, security, and scalability. In this study, we propose an innovative approach that utilizes Graph Convolutional Networks (GCNs) to analyze the information propagation patterns in the Ethereum network. The first phase of our research involves data collection from the Ethereum blockchain, consisting of blocks, transactions, and node degrees. We construct a transaction graph representation using adjacency matrices to capture the node embeddings; while our major contribution is to develop a combined Graph Attention Network (GAT) and Reinforcement Learning (RL) model to optimize the network efficiency and scalability. It learns the best actions to take in various network states, ultimately leading to improved network efficiency, throughput, and optimize gas limits for block processing. In the experimental evaluation, we analyze the performance of our model on a large-scale Ethereum dataset. We investigate effectively aggregating information from neighboring nodes capturing graph structure and updating node embeddings using GCN with the objective of transaction pattern prediction, accounting for varying network loads and number of blocks. Not only we design a gas limit optimization model and provide the algorithm, but also to address scalability, we demonstrate the use and implementation of sparse matrices in GraphConv, GraphSAGE, and GAT. The results indicate that our designed GAT-RL model achieves superior results compared to other GCN models in terms of performance. It effectively propagates information across the network, optimizing gas limits for block processing and improving network efficiency.
    Meaning Representations from Trajectories in Autoregressive Models. (arXiv:2310.18348v2 [cs.CL] UPDATED)
    We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. Moreover, unlike vector-based representations, distribution-based representations can also model asymmetric relations (e.g., direction of logical entailment, hypernym/hyponym relations) by using algebraic operations between likelihood functions. These ideas are grounded in distributional perspectives on semantics and are connected to standard constructions in automata theory, but to our knowledge they have not been applied to modern language models. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle. Finally, we extend our method to represent data from different modalities (e.g., image and text) using multimodal autoregressive models.
    AI for Interpretable Chemistry: Predicting Radical Mechanistic Pathways via Contrastive Learning. (arXiv:2311.01118v1 [cs.LG])
    Deep learning-based reaction predictors have undergone significant architectural evolution. However, their reliance on reactions from the US Patent Office results in a lack of interpretable predictions and limited generalization capability to other chemistry domains, such as radical and atmospheric chemistry. To address these challenges, we introduce a new reaction predictor system, RMechRP, that leverages contrastive learning in conjunction with mechanistic pathways, the most interpretable representation of chemical reactions. Specifically designed for radical reactions, RMechRP provides different levels of interpretation of chemical reactions. We develop and train multiple deep-learning models using RMechDB, a public database of radical reactions, to establish the first benchmark for predicting radical reactions. Our results demonstrate the effectiveness of RMechRP in providing accurate and interpretable predictions of radical reactions, and its potential for various applications in atmospheric chemistry.
    FlashDecoding++: Faster Large Language Model Inference on GPUs. (arXiv:2311.01282v1 [cs.LG])
    As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.
    Sanitized Clustering against Confounding Bias. (arXiv:2311.01252v1 [cs.LG])
    Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB), which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by Variational Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias. The code is available at \url{https://github.com/EvaFlower/SCAB}.
    Boosting Adversarial Transferability by Achieving Flat Local Maxima. (arXiv:2306.05225v2 [cs.CV] UPDATED)
    Transfer-based attack adopts the adversarial examples generated on the surrogate model to attack various models, making it applicable in the physical world and attracting increasing interest. Recently, various adversarial attacks have emerged to boost adversarial transferability from different perspectives. In this work, inspired by the observation that flat local minima are correlated with good generalization, we assume and empirically validate that adversarial examples at a flat local region tend to have good transferability by introducing a penalized gradient norm to the original loss function. Since directly optimizing the gradient regularization norm is computationally expensive and intractable for generating adversarial examples, we propose an approximation optimization method to simplify the gradient update of the objective function. Specifically, we randomly sample an example and adopt a first-order procedure to approximate the curvature of Hessian/vector product, which makes computing more efficient by interpolating two neighboring gradients. Meanwhile, in order to obtain a more stable gradient direction, we randomly sample multiple examples and average the gradients of these examples to reduce the variance due to random sampling during the iterative process. Extensive experimental results on the ImageNet-compatible dataset show that the proposed method can generate adversarial examples at flat local regions, and significantly improve the adversarial transferability on either normally trained models or adversarially trained models than the state-of-the-art attacks. Our codes are available at: https://github.com/Trustworthy-AI-Group/PGN.
    The Re-Label Method For Data-Centric Machine Learning. (arXiv:2302.04391v6 [cs.LG] UPDATED)
    In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The dev dataset evaluation results and human evaluation results verify our idea.
    Re-weighting Tokens: A Simple and Effective Active Learning Strategy for Named Entity Recognition. (arXiv:2311.00906v1 [cs.CL])
    Active learning, a widely adopted technique for enhancing machine learning models in text and image classification tasks with limited annotation resources, has received relatively little attention in the domain of Named Entity Recognition (NER). The challenge of data imbalance in NER has hindered the effectiveness of active learning, as sequence labellers lack sufficient learning signals. To address these challenges, this paper presents a novel reweighting-based active learning strategy that assigns dynamic smoothed weights to individual tokens. This adaptable strategy is compatible with various token-level acquisition functions and contributes to the development of robust active learners. Experimental results on multiple corpora demonstrate the substantial performance improvement achieved by incorporating our re-weighting strategy into existing acquisition functions, validating its practical efficacy.
    Real-Time Magnetic Tracking and Diagnosis of COVID-19 via Machine Learning. (arXiv:2311.00737v1 [cs.LG])
    The COVID-19 pandemic underscored the importance of reliable, noninvasive diagnostic tools for robust public health interventions. In this work, we fused magnetic respiratory sensing technology (MRST) with machine learning (ML) to create a diagnostic platform for real-time tracking and diagnosis of COVID-19 and other respiratory diseases. The MRST precisely captures breathing patterns through three specific breath testing protocols: normal breath, holding breath, and deep breath. We collected breath data from both COVID-19 patients and healthy subjects in Vietnam using this platform, which then served to train and validate ML models. Our evaluation encompassed multiple ML algorithms, including support vector machines and deep learning models, assessing their ability to diagnose COVID-19. Our multi-model validation methodology ensures a thorough comparison and grants the adaptability to select the most optimal model, striking a balance between diagnostic precision with model interpretability. The findings highlight the exceptional potential of our diagnostic tool in pinpointing respiratory anomalies, achieving over 90% accuracy. This innovative sensor technology can be seamlessly integrated into healthcare settings for patient monitoring, marking a significant enhancement for the healthcare infrastructure.
    Beyond Ensemble Averages: Leveraging Climate Model Ensembles for Subseasonal Forecasting. (arXiv:2211.15856v2 [cs.LG] UPDATED)
    Producing high-quality forecasts of key climate variables such as temperature and precipitation on subseasonal time scales has long been a gap in operational forecasting. Recent studies have shown promising results using machine learning (ML) models to advance subseasonal forecasting (SSF), but several open questions remain. First, several past approaches use the average of an ensemble of physics-based forecasts as an input feature of these models. However, ensemble forecasts contain information that can aid prediction beyond only the ensemble mean. Second, past methods have focused on average performance, whereas forecasts of extreme events are far more important for planning and mitigation purposes. Third, climate forecasts correspond to a spatially-varying collection of forecasts, and different methods account for spatial variability in the response differently. Trade-offs between different approaches may be mitigated with model stacking. This paper describes the application of a variety of ML methods used to predict monthly average precipitation and two meter temperature using physics-based predictions (ensemble forecasts) and observational data such as relative humidity, pressure at sea level, or geopotential height, two weeks in advance for the whole continental United States. Regression, quantile regression, and tercile classification tasks using linear models, random forests, convolutional neural networks, and stacked models are considered. The proposed models outperform common baselines such as historical averages (or quantiles) and ensemble averages (or quantiles). This paper further includes an investigation of feature importance, trade-offs between using the full ensemble or only the ensemble average, and different modes of accounting for spatial variability.
    DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning. (arXiv:2311.01295v1 [cs.LG])
    Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.
    SatBird: Bird Species Distribution Modeling with Remote Sensing and Citizen Science Data. (arXiv:2311.00936v1 [cs.LG])
    Biodiversity is declining at an unprecedented rate, impacting ecosystem services necessary to ensure food, water, and human health and well-being. Understanding the distribution of species and their habitats is crucial for conservation policy planning. However, traditional methods in ecology for species distribution models (SDMs) generally focus either on narrow sets of species or narrow geographical areas and there remain significant knowledge gaps about the distribution of species. A major reason for this is the limited availability of data traditionally used, due to the prohibitive amount of effort and expertise required for traditional field monitoring. The wide availability of remote sensing data and the growing adoption of citizen science tools to collect species observations data at low cost offer an opportunity for improving biodiversity monitoring and enabling the modelling of complex ecosystems. We introduce a novel task for mapping bird species to their habitats by predicting species encounter rates from satellite images, and present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird, considering summer (breeding) and winter seasons. We also provide a dataset in Kenya representing low-data regimes. We additionally provide environmental data and species range maps for each location. We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks. SatBird opens up possibilities for scalably modelling properties of ecosystems worldwide.
    Gradient-free online learning of subgrid-scale dynamics with neural emulators. (arXiv:2310.19385v2 [physics.comp-ph] UPDATED)
    In this paper, we propose a generic algorithm to train machine learning-based subgrid parametrizations online, i.e., with $\textit{a posteriori}$ loss functions for non-differentiable numerical solvers. The proposed approach leverage neural emulators to train an approximation of the reduced state-space solver, which is then used to allows gradient propagation through temporal integration steps. The algorithm is able to recover most of the benefit of online strategies without having to compute the gradient of the original solver. It is demonstrated that training the neural emulator and parametrization components separately with respective loss quantities is necessary in order to minimize the propagation of some approximation bias.
    Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation. (arXiv:2307.02598v2 [cs.LG] UPDATED)
    We tackle the problems of latent variables identification and ``out-of-support'' image generation in representation learning. We show that both are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.
    Adapt On-the-Go: Behavior Modulation for Single-Life Robot Deployment. (arXiv:2311.01059v1 [cs.RO])
    To succeed in the real world, robots must cope with situations that differ from those seen during training. We study the problem of adapting on-the-fly to such novel scenarios during deployment, by drawing upon a diverse repertoire of previously learned behaviors. Our approach, RObust Autonomous Modulation (ROAM), introduces a mechanism based on the perceived value of pre-trained behaviors to select and adapt pre-trained behaviors to the situation at hand. Crucially, this adaptation process all happens within a single episode at test time, without any human supervision. We provide theoretical analysis of our selection mechanism and demonstrate that ROAM enables a robot to adapt rapidly to changes in dynamics both in simulation and on a real Go1 quadruped, even successfully moving forward with roller skates on its feet. Our approach adapts over 2x as efficiently compared to existing methods when facing a variety of out-of-distribution situations during deployment by effectively choosing and adapting relevant behaviors on-the-fly.
    Transparent Anomaly Detection via Concept-based Explanations. (arXiv:2310.10702v2 [cs.LG] UPDATED)
    Advancements in deep learning techniques have given a boost to the performance of anomaly detection. However, real-world and safety-critical applications demand a level of transparency and reasoning beyond accuracy. The task of anomaly detection (AD) focuses on finding whether a given sample follows the learned distribution. Existing methods lack the ability to reason with clear explanations for their outcomes. Hence to overcome this challenge, we propose Transparent {A}nomaly Detection {C}oncept {E}xplanations (ACE). ACE is able to provide human interpretable explanations in the form of concepts along with anomaly prediction. To the best of our knowledge, this is the first paper that proposes interpretable by-design anomaly detection. In addition to promoting transparency in AD, it allows for effective human-model interaction. Our proposed model shows either higher or comparable results to black-box uninterpretable models. We validate the performance of ACE across three realistic datasets - bird classification on CUB-200-2011, challenging histopathology slide image classification on TIL-WSI-TCGA, and gender classification on CelebA. We further demonstrate that our concept learning paradigm can be seamlessly integrated with other classification-based AD methods.
    On Finding Bi-objective Pareto-optimal Fraud Prevention Rule Sets for Fintech Applications. (arXiv:2311.00964v1 [cs.LG])
    Rules are widely used in Fintech institutions to make fraud prevention decisions, since rules are highly interpretable thanks to their intuitive if-then structure. In practice, a two-stage framework of fraud prevention decision rule set mining is usually employed in large Fintech institutions. This paper is concerned with finding high-quality rule subsets in a bi-objective space (such as precision and recall) from an initial pool of rules. To this end, we adopt the concept of Pareto optimality and aim to find a set of non-dominated rule subsets, which constitutes a Pareto front. We propose a heuristic-based framework called PORS and we identify that the core of PORS is the problem of solution selection on the front (SSF). We provide a systematic categorization of the SSF problem and a thorough empirical evaluation of various SSF methods on both public and proprietary datasets. We also introduce a novel variant of sequential covering algorithm called SpectralRules to encourage the diversity of the initial rule set and we empirically find that SpectralRules further improves the quality of the found Pareto front. On two real application scenarios within Alipay, we demonstrate the advantages of our proposed methodology compared to existing work.
    Low-latency Real-time Voice Conversion on CPU. (arXiv:2311.00873v1 [cs.SD])
    We adapt the architectures of previous audio manipulation and generation neural networks to the task of real-time any-to-one voice conversion. Our resulting model, LLVC ($\textbf{L}$ow-latency $\textbf{L}$ow-resource $\textbf{V}$oice $\textbf{C}$onversion), has a latency of under 20ms at a bitrate of 16kHz and runs nearly 2.8x faster than real-time on a consumer CPU. LLVC uses both a generative adversarial architecture as well as knowledge distillation in order to attain this performance. To our knowledge LLVC achieves both the lowest resource usage as well as the lowest latency of any open-source voice conversion model. We provide open-source samples, code, and pretrained model weights at https://github.com/KoeAI/LLVC.
    Identifying Alzheimer Disease Dementia Levels Using Machine Learning Methods. (arXiv:2311.01428v1 [cs.LG])
    Dementia, a prevalent neurodegenerative condition, is a major manifestation of Alzheimer's disease (AD). As the condition progresses from mild to severe, it significantly impairs the individual's ability to perform daily tasks independently, necessitating the need for timely and accurate AD classification. Machine learning or deep learning models have emerged as effective tools for this purpose. In this study, we suggested an approach for classifying the four stages of dementia using RF, SVM, and CNN algorithms, augmented with watershed segmentation for feature extraction from MRI images. Our results reveal that SVM with watershed features achieves an impressive accuracy of 96.25%, surpassing other classification methods. The ADNI dataset is utilized to evaluate the effectiveness of our method, and we observed that the inclusion of watershed segmentation contributes to the enhanced performance of the models.
    Textually Pretrained Speech Language Models. (arXiv:2305.13009v2 [cs.CL] UPDATED)
    Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. We make speech samples, code and models publicly available: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .
    Deep learning based Image Compression for Microscopy Images: An Empirical Study. (arXiv:2311.01352v1 [eess.IV])
    With the fast development of modern microscopes and bioimaging techniques, an unprecedentedly large amount of imaging data are being generated, stored, analyzed, and even shared through networks. The size of the data poses great challenges for current data infrastructure. One common way to reduce the data size is by image compression. This present study analyzes classic and deep learning based image compression methods, and their impact on deep learning based image processing models. Deep learning based label-free prediction models (i.e., predicting fluorescent images from bright field images) are used as an example application for comparison and analysis. Effective image compression methods could help reduce the data size significantly without losing necessary information, and therefore reduce the burden on data management infrastructure and permit fast transmission through the network for data sharing or cloud computing. To compress images in such a wanted way, multiple classical lossy image compression techniques are compared to several AI-based compression models provided by and trained with the CompressAI toolbox using python. These different compression techniques are compared in compression ratio, multiple image similarity measures and, most importantly, the prediction accuracy from label-free models on compressed images. We found that AI-based compression techniques largely outperform the classic ones and will minimally affect the downstream label-free task in 2D cases. In the end, we hope the present study could shed light on the potential of deep learning based image compression and the impact of image compression on downstream deep learning based image analysis models.
    Exploration noise for learning linear-quadratic mean field games. (arXiv:2107.00839v2 [math.OC] UPDATED)
    The goal of this paper is to demonstrate that common noise may serve as an exploration noise for learning the solution of a mean field game. This concept is here exemplified through a toy linear-quadratic model, for which a suitable form of common noise has already been proven to restore existence and uniqueness. We here go one step further and prove that the same form of common noise may force the convergence of the learning algorithm called `fictitious play', and this without any further potential or monotone structure. Several numerical examples are provided in order to support our theoretical analysis.
    Unreading Race: Purging Protected Features from Chest X-ray Embeddings. (arXiv:2311.01349v1 [cs.LG])
    Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Materials and Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in chest radiograph embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray classification.
    Fraud Analytics Using Machine-learning & Engineering on Big Data (FAME) for Telecom. (arXiv:2311.00724v1 [cs.LG])
    Telecom industries lose globally 46.3 Billion USD due to fraud. Data mining and machine learning techniques (apart from rules oriented approach) have been used in past, but efficiency has been low as fraud pattern changes very rapidly. This paper presents an industrialized solution approach with self adaptive data mining technique and application of big data technologies to detect fraud and discover novel fraud patterns in accurate, efficient and cost effective manner. Solution has been successfully demonstrated to detect International Revenue Share Fraud with <5% false positive. More than 1 Terra Bytes of Call Detail Record from a reputed wholesale carrier and overseas telecom transit carrier has been used to conduct this study.
    Dynamic Fair Federated Learning Based on Reinforcement Learning. (arXiv:2311.00959v1 [cs.LG])
    Federated learning enables a collaborative training and optimization of global models among a group of devices without sharing local data samples. However, the heterogeneity of data in federated learning can lead to unfair representation of the global model across different devices. To address the fairness issue in federated learning, we propose a dynamic q fairness federated learning algorithm with reinforcement learning, called DQFFL. DQFFL aims to mitigate the discrepancies in device aggregation and enhance the fairness of treatment for all groups involved in federated learning. To quantify fairness, DQFFL leverages the performance of the global federated model on each device and incorporates {\alpha}-fairness to transform the preservation of fairness during federated aggregation into the distribution of client weights in the aggregation process. Considering the sensitivity of parameters in measuring fairness, we propose to utilize reinforcement learning for dynamic parameters during aggregation. Experimental results demonstrate that our DQFFL outperforms the state-of-the-art methods in terms of overall performance, fairness and convergence speed.
    Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly. (arXiv:2311.01323v1 [cs.LG])
    The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 25 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided. Code at: https://github.com/qizhangli/TA-Bench.
    Improving Adversarial Transferability via Intermediate-level Perturbation Decay. (arXiv:2304.13410v3 [cs.LG] UPDATED)
    Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10.07% on average) and CIFAR-10 (+3.88% on average). Our code is at https://github.com/qizhangli/ILPD-attack.
    Holistic Transfer: Towards Non-Disruptive Fine-Tuning with Partial Target Data. (arXiv:2311.01420v1 [cs.LG])
    We propose a learning problem involving adapting a pre-trained source model to the target domain for classifying all classes that appeared in the source data, using target data that covers only a partial label space. This problem is practical, as it is unrealistic for the target end-users to collect data for all classes prior to adaptation. However, it has received limited attention in the literature. To shed light on this issue, we construct benchmark datasets and conduct extensive experiments to uncover the inherent challenges. We found a dilemma -- on the one hand, adapting to the new target domain is important to claim better performance; on the other hand, we observe that preserving the classification accuracy of classes missing in the target adaptation data is highly challenging, let alone improving them. To tackle this, we identify two key directions: 1) disentangling domain gradients from classification gradients, and 2) preserving class relationships. We present several effective solutions that maintain the accuracy of the missing classes and enhance the overall performance, establishing solid baselines for holistic transfer of pre-trained models with partial target data.
    Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis. (arXiv:2311.01052v1 [stat.ML])
    We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.
    Scattering Vision Transformer: Spectral Mixing Matters. (arXiv:2311.01310v1 [cs.CV])
    Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT introduces a unique spectral gating network utilizing Einstein multiplication for token and channel mixing, effectively reducing complexity. We show that SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS. SVT shows 2\% improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy, while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L reaches 85.7\% (again state-of-art for large versions). SVT also shows comparable results in other vision tasks such as instance segmentation. SVT also outperforms other transformers in transfer learning on standard datasets such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The project page is available on this webpage.\url{https://badripatro.github.io/svt/}.
    Deep Double Descent for Time Series Forecasting: Avoiding Undertrained Models. (arXiv:2311.01442v1 [cs.LG])
    Deep learning models, particularly Transformers, have achieved impressive results in various domains, including time series forecasting. While existing time series literature primarily focuses on model architecture modifications and data augmentation techniques, this paper explores the training schema of deep learning models for time series; how models are trained regardless of their architecture. We perform extensive experiments to investigate the occurrence of deep double descent in several Transformer models trained on public time series data sets. We demonstrate epoch-wise deep double descent and that overfitting can be reverted using more epochs. Leveraging these findings, we achieve state-of-the-art results for long sequence time series forecasting in nearly 70% of the 72 benchmarks tested. This suggests that many models in the literature may possess untapped potential. Additionally, we introduce a taxonomy for classifying training schema modifications, covering data augmentation, model inputs, model targets, time series per model, and computational budget.
    SensorSCAN: Self-Supervised Learning and Deep Clustering for Fault Diagnosis in Chemical Processes. (arXiv:2208.08879v2 [cs.LG] UPDATED)
    Modern industrial facilities generate large volumes of raw sensor data during the production process. This data is used to monitor and control the processes and can be analyzed to detect and predict process abnormalities. Typically, the data has to be annotated by experts in order to be used in predictive modeling. However, manual annotation of large amounts of data can be difficult in industrial settings. In this paper, we propose SensorSCAN, a novel method for unsupervised fault detection and diagnosis, designed for industrial chemical process monitoring. We demonstrate our model's performance on two publicly available datasets of the Tennessee Eastman Process with various faults. The results show that our method significantly outperforms existing approaches (+0.2-0.3 TPR for a fixed FPR) and effectively detects most of the process faults without expert annotation. Moreover, we show that the model fine-tuned on a small fraction of labeled data nearly reaches the performance of a SOTA model trained on the full dataset. We also demonstrate that our method is suitable for real-world applications where the number of faults is not known in advance. The code is available at https://github.com/AIRI-Institute/sensorscan.
    Atlas-Based Interpretable Age Prediction In Whole-Body MR Images. (arXiv:2307.07439v3 [eess.IV] UPDATED)
    Age prediction is an important part of medical assessments and research. It can aid in detecting diseases as well as abnormal ageing by highlighting the discrepancy between chronological and biological age. To gain a comprehensive understanding of age-related changes observed in various body parts, we investigate them on a larger scale by using whole-body 3D images. We utilise the Grad-CAM interpretability method to determine the body areas most predictive of a person's age. We expand our analysis beyond individual subjects by employing registration techniques to generate population-wide interpretability maps. Our findings reveal three primary areas of interest: the spine, the autochthonous back muscles, and the cardiac region, which exhibits the highest importance.
    Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching. (arXiv:2311.01331v1 [cs.LG])
    In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, Offline Learning from Observations (LfO) is extensively studied, where the agent learns to solve a task with only expert states and \textit{task-agnostic} non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods minimize the state occupancy divergence between the learner and expert policies. However, they are limited to either $f$-divergences (KL and $\chi^2$) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To address this problem, we propose Primal Wasserstein DICE (PW-DICE), which minimizes the primal Wasserstein distance between the expert and learner state occupancies with a pessimistic regularizer and leverages a contrastively learned distance as the underlying metric for the Wasserstein distance. Theoretically, we prove that our framework is a generalization of the state-of-the-art, SMODICE, and unifies $f$-divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods on multiple testbeds.
    BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization. (arXiv:2305.18666v2 [cs.LG] UPDATED)
    The popularity of bi-level optimization (BO) in deep learning has spurred a growing interest in studying gradient-based BO algorithms. However, existing algorithms involve two coupled learning rates that can be affected by approximation errors when computing hypergradients, making careful fine-tuning necessary to ensure fast convergence. To alleviate this issue, we investigate the use of recently proposed adaptive step-size methods, namely stochastic line search (SLS) and stochastic Polyak step size (SPS), for computing both the upper and lower-level learning rates. First, we revisit the use of SLS and SPS in single-level optimization without the additional interpolation condition that is typically assumed in prior works. For such settings, we investigate new variants of SLS and SPS that improve upon existing suggestions in the literature and are simpler to implement. Importantly, these two variants can be seen as special instances of general family of methods with an envelope-type step-size. This unified envelope strategy allows for the extension of the algorithms and their convergence guarantees to BO settings. Finally, our extensive experiments demonstrate that the new algorithms, which are available in both SGD and Adam versions, can find large learning rates with minimal tuning and converge faster than corresponding vanilla SGD or Adam BO algorithms that require fine-tuning.
    Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion. (arXiv:2311.01017v1 [cs.CV])
    Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer into the discrete diffusion framework with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents.
    AeroPath: An airway segmentation benchmark dataset with challenging pathology. (arXiv:2311.01138v1 [cs.CV])
    To improve the prognosis of patients suffering from pulmonary diseases, such as lung cancer, early diagnosis and treatment are crucial. The analysis of CT images is invaluable for diagnosis, whereas high quality segmentation of the airway tree are required for intervention planning and live guidance during bronchoscopy. Recently, the Multi-domain Airway Tree Modeling (ATM'22) challenge released a large dataset, both enabling training of deep-learning based models and bringing substantial improvement of the state-of-the-art for the airway segmentation task. However, the ATM'22 dataset includes few patients with severe pathologies affecting the airway tree anatomy. In this study, we introduce a new public benchmark dataset (AeroPath), consisting of 27 CT images from patients with pathologies ranging from emphysema to large tumors, with corresponding trachea and bronchi annotations. Second, we present a multiscale fusion design for automatic airway segmentation. Models were trained on the ATM'22 dataset, tested on the AeroPath dataset, and further evaluated against competitive open-source methods. The same performance metrics as used in the ATM'22 challenge were used to benchmark the different considered approaches. Lastly, an open web application is developed, to easily test the proposed model on new data. The results demonstrated that our proposed architecture predicted topologically correct segmentations for all the patients included in the AeroPath dataset. The proposed method is robust and able to handle various anomalies, down to at least the fifth airway generation. In addition, the AeroPath dataset, featuring patients with challenging pathologies, will contribute to development of new state-of-the-art methods. The AeroPath dataset and the web application are made openly available.
    In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer. (arXiv:2311.01106v1 [cs.LG])
    Enabling machine learning classifiers to defer their decision to a downstream expert when the expert is more accurate will ensure improved safety and performance. This objective can be achieved with the learning-to-defer framework which aims to jointly learn how to classify and how to defer to the expert. In recent studies, it has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring which makes them uncalibrated. However, it remains unknown whether this is due to the widely used softmax parameterization and if we can find a softmax-based estimator that is both statistically consistent and possesses a valid probability estimator. In this work, we first show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We then propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness. We further analyze the non-asymptotic properties of our method and empirically validate its performance and calibration on benchmark datasets.
    Sequence Modeling with Multiresolution Convolutional Memory. (arXiv:2305.01638v2 [cs.LG] UPDATED)
    Efficiently capturing the long-range patterns in sequential data sources salient to a given task -- such as classification and generative modeling -- poses a fundamental challenge. Popular approaches in the space tradeoff between the memory burden of brute-force enumeration and comparison, as in transformers, the computational burden of complicated sequential dependencies, as in recurrent neural networks, or the parameter burden of convolutional networks with many or large filters. We instead take inspiration from wavelet-based multiresolution analysis to define a new building block for sequence modeling, which we call a MultiresLayer. The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence. Our MultiresConv can be implemented with shared filters across a dilated causal convolution tree. Thus it garners the computational advantages of convolutional networks and the principled theoretical motivation of wavelet decompositions. Our MultiresLayer is straightforward to implement, requires significantly fewer parameters, and maintains at most a $\mathcal{O}(N\log N)$ memory footprint for a length $N$ sequence. Yet, by stacking such layers, our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks using CIFAR-10, ListOps, and PTB-XL datasets.
    Learning Realistic Traffic Agents in Closed-loop. (arXiv:2311.01394v1 [cs.RO])
    Realistic traffic simulation is crucial for developing self-driving software in a safe and scalable manner prior to real-world deployment. Typically, imitation learning (IL) is used to learn human-like traffic agents directly from real-world observations collected offline, but without explicit specification of traffic rules, agents trained from IL alone frequently display unrealistic infractions like collisions and driving off the road. This problem is exacerbated in out-of-distribution and long-tail scenarios. On the other hand, reinforcement learning (RL) can train traffic agents to avoid infractions, but using RL alone results in unhuman-like driving behaviors. We propose Reinforcing Traffic Rules (RTR), a holistic closed-loop learning objective to match expert demonstrations under a traffic compliance constraint, which naturally gives rise to a joint IL + RL approach, obtaining the best of both worlds. Our method learns in closed-loop simulations of both nominal scenarios from real-world datasets as well as procedurally generated long-tail scenarios. Our experiments show that RTR learns more realistic and generalizable traffic simulation policies, achieving significantly better tradeoffs between human-like driving and traffic compliance in both nominal and long-tail scenarios. Moreover, when used as a data generation tool for training prediction models, our learned traffic policy leads to considerably improved downstream prediction metrics compared to baseline traffic agents. For more information, visit the project website: https://waabi.ai/rtr
    Empathy Detection Using Machine Learning on Text, Audiovisual, Audio or Physiological Signals. (arXiv:2311.00721v1 [cs.HC])
    Empathy is a social skill that indicates an individual's ability to understand others. Over the past few years, empathy has drawn attention from various disciplines, including but not limited to Affective Computing, Cognitive Science and Psychology. Empathy is a context-dependent term; thus, detecting or recognising empathy has potential applications in society, healthcare and education. Despite being a broad and overlapping topic, the avenue of empathy detection studies leveraging Machine Learning remains underexplored from a holistic literature perspective. To this end, we systematically collect and screen 801 papers from 10 well-known databases and analyse the selected 54 papers. We group the papers based on input modalities of empathy detection systems, i.e., text, audiovisual, audio and physiological signals. We examine modality-specific pre-processing and network architecture design protocols, popular dataset descriptions and availability details, and evaluation protocols. We further discuss the potential applications, deployment challenges and research gaps in the Affective Computing-based empathy domain, which can facilitate new avenues of exploration. We believe that our work is a stepping stone to developing a privacy-preserving and unbiased empathic system inclusive of culture, diversity and multilingualism that can be deployed in practice to enhance the overall well-being of human life.
    A Deep Learning algorithm to accelerate Algebraic Multigrid methods in Finite Element solvers of 3D elliptic PDEs. (arXiv:2304.10832v3 [math.NA] UPDATED)
    Algebraic multigrid (AMG) methods are among the most efficient solvers for linear systems of equations and they are widely used for the solution of problems stemming from the discretization of Partial Differential Equations (PDEs). The most severe limitation of AMG methods is the dependence on parameters that require to be fine-tuned. In particular, the strong threshold parameter is the most relevant since it stands at the basis of the construction of successively coarser grids needed by the AMG methods. We introduce a novel Deep Learning algorithm that minimizes the computational cost of the AMG method when used as a finite element solver. We show that our algorithm requires minimal changes to any existing code. The proposed Artificial Neural Network (ANN) tunes the value of the strong threshold parameter by interpreting the sparse matrix of the linear system as a black-and-white image and exploiting a pooling operator to transform it into a small multi-channel image. We experimentally prove that the pooling successfully reduces the computational cost of processing a large sparse matrix and preserves the features needed for the regression task at hand. We train the proposed algorithm on a large dataset containing problems with a highly heterogeneous diffusion coefficient defined in different three-dimensional geometries and discretized with unstructured grids and linear elasticity problems with a highly heterogeneous Young's modulus. When tested on problems with coefficients or geometries not present in the training dataset, our approach reduces the computational time by up to 30%.
    Diable: Efficient Dialogue State Tracking as Operations on Tables. (arXiv:2305.17020v3 [cs.CL] UPDATED)
    Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. We propose Diable, a new task formalisation that simplifies the design and implementation of efficient DST systems and allows one to easily plug and play large language models. We represent the dialogue state as a table and formalise DST as a table manipulation task. At each turn, the system updates the previous state by generating table operations based on the dialogue context. Extensive experimentation on the MultiWoz datasets demonstrates that Diable (i) outperforms strong efficient DST baselines, (ii) is 2.4x more time efficient than current state-of-the-art methods while retaining competitive Joint Goal Accuracy, and (iii) is robust to noisy data annotations due to the table operations approach.
    Can Large Language Models Design Accurate Label Functions?. (arXiv:2311.00739v1 [cs.CL])
    Programmatic weak supervision methodologies facilitate the expedited labeling of extensive datasets through the use of label functions (LFs) that encapsulate heuristic data sources. Nonetheless, the creation of precise LFs necessitates domain expertise and substantial endeavors. Recent advances in pre-trained language models (PLMs) have exhibited substantial potential across diverse tasks. However, the capacity of PLMs to autonomously formulate accurate LFs remains an underexplored domain. In this research, we address this gap by introducing DataSculpt, an interactive framework that harnesses PLMs for the automated generation of LFs. Within DataSculpt, we incorporate an array of prompting techniques, instance selection strategies, and LF filtration methods to explore the expansive design landscape. Ultimately, we conduct a thorough assessment of DataSculpt's performance on 12 real-world datasets, encompassing a range of tasks. This evaluation unveils both the strengths and limitations of contemporary PLMs in LF design.
    Gaussian Processes on Cellular Complexes. (arXiv:2311.01198v1 [cs.LG])
    In recent years, there has been considerable interest in developing machine learning models on graphs in order to account for topological inductive biases. In particular, recent attention was given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic setting and consider polyadic relations that include interactions between vertices, edges and one of their generalisations, known as cells. Specifically, we propose Gaussian processes on cellular complexes, a generalisation of graphs that captures interactions between these higher-order cells. One of our key contributions is the derivation of two novel kernels, one that generalises the graph Mat\'ern kernel and one that additionally mixes information of different cell types.
    Representation Equivalent Neural Operators: a Framework for Alias-free Operator Learning. (arXiv:2305.19913v2 [cs.LG] UPDATED)
    Recently, operator learning, or learning mappings between infinite-dimensional function spaces, has garnered significant attention, notably in relation to learning partial differential equations from data. Conceptually clear when outlined on paper, neural operators necessitate discretization in the transition to computer implementations. This step can compromise their integrity, often causing them to deviate from the underlying operators. This research offers a fresh take on neural operators with a framework Representation equivalent Neural Operators (ReNO) designed to address these issues. At its core is the concept of operator aliasing, which measures inconsistency between neural operators and their discrete representations. We explore this for widely-used operator learning techniques. Our findings detail how aliasing introduces errors when handling different discretizations and grids and loss of crucial continuous structures. More generally, this framework not only sheds light on existing challenges but, given its constructive and broad nature, also potentially offers tools for developing new neural operators.
    Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients. (arXiv:2212.14319v4 [stat.ML] UPDATED)
    Partial differential equations (PDEs) are important tools to model physical systems and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works as a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or pointwise defined initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDEs, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.
    tmn at #SMM4H 2023: Comparing Text Preprocessing Techniques for Detecting Tweets Self-reporting a COVID-19 Diagnosis. (arXiv:2311.00732v1 [cs.CL])
    The paper describes a system developed for Task 1 at SMM4H 2023. The goal of the task is to automatically distinguish tweets that self-report a COVID-19 diagnosis (for example, a positive test, clinical diagnosis, or hospitalization) from those that do not. We investigate the use of different techniques for preprocessing tweets using four transformer-based models. The ensemble of fine-tuned language models obtained an F1-score of 84.5%, which is 4.1% higher than the average value.
    GIST: Generated Inputs Sets Transferability in Deep Learning. (arXiv:2311.00801v1 [cs.LG])
    As the demand for verifiability and testability of neural networks continues to rise, an increasing number of methods for generating test sets are being developed. However, each of these techniques tends to emphasize specific testing aspects and can be quite time-consuming. A straightforward solution to mitigate this issue is to transfer test sets between some benchmarked models and a new model under test, based on a desirable property one wishes to transfer. This paper introduces GIST (Generated Inputs Sets Transferability), a novel approach for the efficient transfer of test sets among Deep Learning models. Given a property of interest that a user wishes to transfer (e.g., coverage criterion), GIST enables the selection of good test sets from the point of view of this property among available ones from a benchmark. We empirically evaluate GIST on fault types coverage property with two modalities and different test set generation procedures to demonstrate the approach's feasibility. Experimental results show that GIST can select an effective test set for the given property to transfer it to the model under test. Our results suggest that GIST could be applied to transfer other properties and could generalize to different test sets' generation procedures and modalities
    Time-series Generation by Contrastive Imitation. (arXiv:2311.01388v1 [stat.ML])
    Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.
    Mahalanobis-Aware Training for Out-of-Distribution Detection. (arXiv:2311.00808v1 [cs.LG])
    While deep learning models have seen widespread success in controlled environments, there are still barriers to their adoption in open-world settings. One critical task for safe deployment is the detection of anomalous or out-of-distribution samples that may require human intervention. In this work, we present a novel loss function and recipe for training networks with improved density-based out-of-distribution sensitivity. We demonstrate the effectiveness of our method on CIFAR-10, notably reducing the false-positive rate of the relative Mahalanobis distance method on far-OOD tasks by over 50%.
    Electronic excited states from physically-constrained machine learning. (arXiv:2311.00844v1 [physics.chem-ph])
    Data-driven techniques are increasingly used to replace electronic-structure calculations of matter. In this context, a relevant question is whether machine learning (ML) should be applied directly to predict the desired properties or be combined explicitly with physically-grounded operations. We present an example of an integrated modeling approach, in which a symmetry-adapted ML model of an effective Hamiltonian is trained to reproduce electronic excitations from a quantum-mechanical calculation. The resulting model can make predictions for molecules that are much larger and more complex than those that it is trained on, and allows for dramatic computational savings by indirectly targeting the outputs of well-converged calculations while using a parameterization corresponding to a minimal atom-centered basis. These results emphasize the merits of intertwining data-driven techniques with physical approximations, improving the transferability and interpretability of ML models without affecting their accuracy and computational efficiency, and providing a blueprint for developing ML-augmented electronic-structure methods.
    Learning Collective Behaviors from Observation. (arXiv:2311.00875v1 [cs.LG])
    We present a review of a series of learning methods used to identify the structure of dynamical systems, aiming to understand emergent behaviors in complex systems of interacting agents. These methods not only offer theoretical guarantees of convergence but also demonstrate computational efficiency in handling high-dimensional observational data. They can manage observation data from both first- and second-order dynamical systems, accounting for observation/stochastic noise, complex interaction rules, missing interaction features, and real-world observations of interacting agent systems. The essence of developing such a series of learning methods lies in designing appropriate loss functions using the variational inverse problem approach, which inherently provides dimension reduction capabilities to our learning methods.
    Long-Range Neural Atom Learning for Molecular Graphs. (arXiv:2311.01276v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely adopted for drug discovery with molecular graphs. Nevertheless, current GNNs are mainly good at leveraging short-range interactions (SRI) but struggle to capture long-range interactions (LRI), both of which are crucial for determining molecular properties. To tackle this issue, we propose a method that implicitly projects all original atoms into a few Neural Atoms, which abstracts the collective information of atomic groups within a molecule. Specifically, we explicitly exchange the information among neural atoms and project them back to the atoms' representations as an enhancement. With this mechanism, neural atoms establish the communication channels among distant nodes, effectively reducing the interaction scope of arbitrary node pairs into a single hop. To provide an inspection of our method from a physical perspective, we reveal its connection with the traditional LRI calculation method, Ewald Summation. We conduct extensive experiments on three long-range graph benchmarks, covering both graph-level and link-level tasks on molecular graphs. We empirically justify that our method can be equipped with an arbitrary GNN and help to capture LRI.
    SmoothHess: ReLU Network Feature Interactions via Stein's Lemma. (arXiv:2311.00858v1 [cs.LG])
    Several recent methods for interpretability model feature interactions by looking at the Hessian of a neural network. This poses a challenge for ReLU networks, which are piecewise-linear and thus have a zero Hessian almost everywhere. We propose SmoothHess, a method of estimating second-order interactions through Stein's Lemma. In particular, we estimate the Hessian of the network convolved with a Gaussian through an efficient sampling algorithm, requiring only network gradient calls. SmoothHess is applied post-hoc, requires no modifications to the ReLU network architecture, and the extent of smoothing can be controlled explicitly. We provide a non-asymptotic bound on the sample complexity of our estimation procedure. We validate the superior ability of SmoothHess to capture interactions on benchmark datasets and a real-world medical spirometry dataset.
    Optimizing Inventory Routing: A Decision-Focused Learning Approach using Neural Networks. (arXiv:2311.00983v1 [cs.LG])
    Inventory Routing Problem (IRP) is a crucial challenge in supply chain management as it involves optimizing efficient route selection while considering the uncertainty of inventory demand planning. To solve IRPs, usually a two-stage approach is employed, where demand is predicted using machine learning techniques first, and then an optimization algorithm is used to minimize routing costs. Our experiment shows machine learning models fall short of achieving perfect accuracy because inventory levels are influenced by the dynamic business environment, which, in turn, affects the optimization problem in the next stage, resulting in sub-optimal decisions. In this paper, we formulate and propose a decision-focused learning-based approach to solving real-world IRPs. This approach directly integrates inventory prediction and routing optimization within an end-to-end system potentially ensuring a robust supply chain strategy.
    Tailoring Mixup to Data using Kernel Warping functions. (arXiv:2311.01434v1 [cs.LG])
    Data augmentation is an essential building block for learning efficient deep learning models. Among all augmentation techniques proposed so far, linear interpolation of training data points, also called mixup, has found to be effective for a large panel of applications. While the majority of works have focused on selecting the right points to mix, or applying complex non-linear interpolation, we are interested in mixing similar points more frequently and strongly than less similar ones. To this end, we propose to dynamically change the underlying distribution of interpolation coefficients through warping functions, depending on the similarity between data points to combine. We define an efficient and flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves both performance and calibration of models. Code available in https://github.com/ENSTA-U2IS/torch-uncertainty
    Fitted Value Iteration Methods for Bicausal Optimal Transport. (arXiv:2306.12658v2 [stat.ML] UPDATED)
    We develop a fitted value iteration (FVI) method to compute bicausal optimal transport (OT) where couplings have an adapted structure. Based on the dynamic programming formulation, FVI adopts a function class to approximate the value functions in bicausal OT. Under the concentrability condition and approximate completeness assumption, we prove the sample complexity using (local) Rademacher complexity. Furthermore, we demonstrate that multilayer neural networks with appropriate structures satisfy the crucial assumptions required in sample complexity proofs. Numerical experiments reveal that FVI outperforms linear programming and adapted Sinkhorn methods in scalability as the time horizon increases, while still maintaining acceptable accuracy.
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v5 [cs.LG] UPDATED)
    We provide the first finite-particle convergence rate for Stein variational gradient descent (SVGD), a popular algorithm for approximating a probability distribution with a collection of particles. Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.
    Towards Safe Propofol Dosing during General Anesthesia Using Deep Offline Reinforcement Learning. (arXiv:2303.10180v2 [cs.LG] UPDATED)
    Automated anesthesia promises to enable more precise and personalized anesthetic administration and free anesthesiologists from repetitive tasks, allowing them to focus on the most critical aspects of a patient's surgical care. Current research has typically focused on creating simulated environments from which agents can learn. These approaches have demonstrated good experimental results, but are still far from clinical application. In this paper, Policy Constraint Q-Learning (PCQL), a data-driven reinforcement learning algorithm for solving the problem of learning anesthesia strategies on real clinical datasets, is proposed. Conservative Q-Learning was first introduced to alleviate the problem of Q function overestimation in an offline context. A policy constraint term is added to agent training to keep the policy distribution of the agent and the anesthesiologist consistent to ensure safer decisions made by the agent in anesthesia scenarios. The effectiveness of PCQL was validated by extensive experiments on a real clinical anesthesia dataset. Experimental results show that PCQL is predicted to achieve higher gains than the baseline approach while maintaining good agreement with the reference dose given by the anesthesiologist, using less total dose, and being more responsive to the patient's vital signs. In addition, the confidence intervals of the agent were investigated, which were able to cover most of the clinical decisions of the anesthesiologist. Finally, an interpretable method, SHAP, was used to analyze the contributing components of the model predictions to increase the transparency of the model.
    VIGraph: Self-supervised Learning for Class-Imbalanced Node Classification. (arXiv:2311.01191v1 [cs.LG])
    Class imbalance in graph data poses significant challenges for node classification. Existing methods, represented by SMOTE-based approaches, partially alleviate this issue but still exhibit limitations during imbalanced scenario construction. Self-supervised learning (SSL) offers a promising solution by synthesizing minority nodes from the data itself, yet its potential remains unexplored. In this paper, we analyze the limitations of SMOTE-based approaches and introduce VIGraph, a novel SSL model based on the self-supervised Variational Graph Auto-Encoder (VGAE) that leverages Variational Inference (VI) to generate minority nodes. Specifically, VIGraph strictly adheres to the concept of imbalance when constructing imbalanced graphs and utilizes the generative VGAE to generate minority nodes. Moreover, VIGraph introduces a novel Siamese contrastive strategy at the decoding phase to improve the overall quality of generated nodes. VIGraph can generate high-quality nodes without reintegrating them into the original graph, eliminating the "Generating, Reintegrating, and Retraining" process found in SMOTE-based methods. Experiments on multiple real-world datasets demonstrate that VIGraph achieves promising results for class-imbalanced node classification tasks.
    Entropy-based Discovery of Summary Causal Graphs in Time Series. (arXiv:2105.10381v2 [cs.AI] UPDATED)
    This study addresses the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new causal temporal mutual information measure for time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the probability raising principle. We finally combine these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. There algorithm are evaluated on several datasets, which shows both their efficacy and efficiency.
    On the Lipschitz constant of random neural networks. (arXiv:2311.01356v1 [stat.ML])
    Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. However, only few theoretical results regarding this quantity exist in the literature. In this paper, we initiate the study of the Lipschitz constant of random ReLU neural networks, i.e., neural networks whose weights are chosen at random and which employ the ReLU activation function. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. Moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the Lipschitz constant. These bounds match up to a logarithmic factor that depends on the depth.
    Learning Defect Prediction from Unrealistic Data. (arXiv:2311.00931v1 [cs.LG])
    Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications
    Language Model Training Paradigms for Clinical Feature Embeddings. (arXiv:2311.00768v1 [cs.LG])
    In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. We publish our code online for replication.
    Generating QM1B with PySCF$_{\text{IPU}}$. (arXiv:2311.01135v1 [cs.LG])
    The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: this http URL
    Optimal Cost Constrained Adversarial Attacks For Multiple Agent Systems. (arXiv:2311.00859v1 [cs.LG])
    Finding optimal adversarial attack strategies is an important topic in reinforcement learning and the Markov decision process. Previous studies usually assume one all-knowing coordinator (attacker) for whom attacking different recipient (victim) agents incurs uniform costs. However, in reality, instead of using one limitless central attacker, the attacks often need to be performed by distributed attack agents. We formulate the problem of performing optimal adversarial agent-to-agent attacks using distributed attack agents, in which we impose distinct cost constraints on each different attacker-victim pair. We propose an optimal method integrating within-step static constrained attack-resource allocation optimization and between-step dynamic programming to achieve the optimal adversarial attack in a multi-agent system. Our numerical results show that the proposed attacks can significantly reduce the rewards received by the attacked agents.
    Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. (arXiv:2311.01011v1 [cs.LG])
    While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper
    Time-Independent Information-Theoretic Generalization Bounds for SGLD. (arXiv:2311.01046v1 [cs.LG])
    We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds.
    SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization. (arXiv:2311.00880v1 [cs.LG])
    Incorporating safety is an essential prerequisite for broadening the practical applications of reinforcement learning in real-world scenarios. To tackle this challenge, Constrained Markov Decision Processes (CMDPs) are leveraged, which introduce a distinct cost function representing safety violations. In CMDPs' settings, Lagrangian relaxation technique has been employed in previous algorithms to convert constrained optimization problems into unconstrained dual problems. However, these algorithms may inaccurately predict unsafe behavior, resulting in instability while learning the Lagrange multiplier. This study introduces a novel safe reinforcement learning algorithm, Safety Critic Policy Optimization (SCPO). In this study, we define the safety critic, a mechanism that nullifies rewards obtained through violating safety constraints. Furthermore, our theoretical analysis indicates that the proposed algorithm can automatically balance the trade-off between adhering to safety constraints and maximizing rewards. The effectiveness of the SCPO algorithm is empirically validated by benchmarking it against strong baselines.
    When Do Graph Neural Networks Help with Node Classification? Investigating the Impact of Homophily Principle on Node Distinguishability. (arXiv:2304.14274v3 [cs.SI] UPDATED)
    Homophily principle, i.e., nodes with the same labels are more likely to be connected, has been believed to be the main reason for the performance superiority of Graph Neural Networks (GNNs) over Neural Networks on node classification tasks. Recent research suggests that, even in the absence of homophily, the advantage of GNNs still exists as long as nodes from the same class share similar neighborhood patterns. However, this argument only considers intra-class Node Distinguishability (ND) but neglects inter-class ND, which provides incomplete understanding of homophily on GNNs. In this paper, we first demonstrate such deficiency with examples and argue that an ideal situation for ND is to have smaller intra-class ND than inter-class ND. To formulate this idea and study ND deeply, we propose Contextual Stochastic Block Model for Homophily (CSBM-H) and define two metrics, Probabilistic Bayes Error (PBE) and negative generalized Jeffreys divergence, to quantify ND. With the metrics, we visualize and analyze how graph filters, node degree distributions and class variances influence ND, and investigate the combined effect of intra- and inter-class ND. Besides, we discovered the mid-homophily pitfall, which occurs widely in graph datasets. Furthermore, we verified that, in real-work tasks, the superiority of GNNs is indeed closely related to both intra- and inter-class ND regardless of homophily levels. Grounded in this observation, we propose a new hypothesis-testing based performance metric beyond homophily, which is non-linear, feature-based and can provide statistical threshold value for GNNs' the superiority. Experiments indicate that it is significantly more effective than the existing homophily metrics on revealing the advantage and disadvantage of graph-aware modes on both synthetic and benchmark real-world datasets.
    Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models. (arXiv:2311.01441v1 [cs.LG])
    We propose a conceptually simple and lightweight framework for improving the robustness of vision models through the combination of knowledge distillation and data augmentation. We address the conjecture that larger models do not make for better teachers by showing strong gains in out-of-distribution robustness when distilling from pretrained foundation models. Following this finding, we propose Discrete Adversarial Distillation (DAD), which leverages a robust teacher to generate adversarial examples and a VQGAN to discretize them, creating more informative samples than standard data augmentation techniques. We provide a theoretical framework for the use of a robust teacher in the knowledge distillation with data augmentation setting and demonstrate strong gains in out-of-distribution robustness and clean accuracy across different student architectures. Notably, our method adds minor computational overhead compared to similar techniques and can be easily combined with other data augmentations for further improvements.
    Attacking Graph Neural Networks with Bit Flips: Weisfeiler and Lehman Go Indifferent. (arXiv:2311.01205v1 [cs.LG])
    Prior attacks on graph neural networks have mostly focused on graph poisoning and evasion, neglecting the network's weights and biases. Traditional weight-based fault injection attacks, such as bit flip attacks used for convolutional neural networks, do not consider the unique properties of graph neural networks. We propose the Injectivity Bit Flip Attack, the first bit flip attack designed specifically for graph neural networks. Our attack targets the learnable neighborhood aggregation functions in quantized message passing neural networks, degrading their ability to distinguish graph structures and losing the expressivity of the Weisfeiler-Lehman test. Our findings suggest that exploiting mathematical properties specific to certain graph neural network architectures can significantly increase their vulnerability to bit flip attacks. Injectivity Bit Flip Attacks can degrade the maximal expressive Graph Isomorphism Networks trained on various graph property prediction datasets to random output by flipping only a small fraction of the network's bits, demonstrating its higher destructive power compared to a bit flip attack transferred from convolutional neural networks. Our attack is transparent and motivated by theoretical insights which are confirmed by extensive empirical results.
    Multi-Operational Mathematical Derivations in Latent Space. (arXiv:2311.01230v1 [cs.LG])
    This paper investigates the possibility of approximating multiple mathematical operations in latent space for expression derivation. To this end, we introduce different multi-operational representation paradigms, modelling mathematical operations as explicit geometric transformations. By leveraging a symbolic engine, we construct a large-scale dataset comprising 1.7M derivation steps stemming from 61K premises and 6 operators, analysing the properties of each paradigm when instantiated with state-of-the-art neural encoders. Specifically, we investigate how different encoding mechanisms can approximate equational reasoning in latent space, exploring the trade-off between learning different operators and specialising within single operations, as well as the ability to support multi-step derivations and out-of-distribution generalisation. Our empirical analysis reveals that the multi-operational paradigm is crucial for disentangling different operators, while discriminating the conclusions for a single operation is achievable in the original expression encoder. Moreover, we show that architectural choices can heavily affect the training dynamics, structural organisation, and generalisation of the latent space, resulting in significant variations across paradigms and classes of encoders.
    A quantum-classical performance separation in nonconvex optimization. (arXiv:2311.00811v1 [quant-ph])
    In this paper, we identify a family of nonconvex continuous optimization instances, each $d$-dimensional instance with $2^d$ local minima, to demonstrate a quantum-classical performance separation. Specifically, we prove that the recently proposed Quantum Hamiltonian Descent (QHD) algorithm [Leng et al., arXiv:2303.01471] is able to solve any $d$-dimensional instance from this family using $\widetilde{\mathcal{O}}(d^3)$ quantum queries to the function value and $\widetilde{\mathcal{O}}(d^4)$ additional 1-qubit and 2-qubit elementary quantum gates. On the other side, a comprehensive empirical study suggests that representative state-of-the-art classical optimization algorithms/solvers (including Gurobi) would require a super-polynomial time to solve such optimization instances.
    Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images. (arXiv:2311.01064v1 [cs.CV])
    Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.
    Zero Coordinate Shift: Whetted Automatic Differentiation for Physics-informed Operator Learning. (arXiv:2311.00860v1 [cs.LG])
    Automatic differentiation (AD) is a critical step in physics-informed machine learning, required for computing the high-order derivatives of network output w.r.t. coordinates. In this paper, we present a novel and lightweight algorithm to conduct such AD for physics-informed operator learning, as we call the trick of Zero Coordinate Shift (ZCS). Instead of making all sampled coordinates leaf variables, ZCS introduces only one scalar-valued leaf variable for each spatial or temporal dimension, leading to a game-changing performance leap by simplifying the wanted derivatives from "many-roots-many-leaves" to "one-root-many-leaves". ZCS is easy to implement with current deep learning libraries; our own implementation is by extending the DeepXDE package. We carry out a comprehensive benchmark analysis and several case studies, training physics-informed DeepONets to solve partial differential equations (PDEs) without data. The results show that ZCS has persistently brought down GPU memory consumption and wall time for training by an order of magnitude, with the savings increasing with problem scale (i.e., number of functions, number of points and order of PDE). As a low-level optimisation, ZCS entails no restrictions on data, physics (PDEs) or network architecture and does not compromise training results from any aspect.
    Harnessing machine learning for accurate treatment of overlapping opacity species in GCMs. (arXiv:2311.00775v1 [astro-ph.EP])
    To understand high precision observations of exoplanets and brown dwarfs, we need detailed and complex general circulation models (GCMs) that incorporate hydrodynamics, chemistry, and radiation. In this study, we specifically examine the coupling between chemistry and radiation in GCMs and compare different methods for mixing opacities of different chemical species in the correlated-k assumption, when equilibrium chemistry cannot be assumed. We propose a fast machine learning method based on DeepSets (DS), which effectively combines individual correlated-k opacities (k-tables). We evaluate the DS method alongside other published methods like adaptive equivalent extinction (AEE) and random overlap with rebinning and resorting (RORR). We integrate these mixing methods into our GCM (expeRT/MITgcm) and assess their accuracy and performance for the example of the hot Jupiter HD~209458 b. Our findings indicate that the DS method is both accurate and efficient for GCM usage, whereas RORR is too slow. Additionally, we observe that the accuracy of AEE depends on its specific implementation and may introduce numerical issues in achieving radiative transfer solution convergence. We then apply the DS mixing method in a simplified chemical disequilibrium situation, where we model the rainout of TiO and VO, and confirm that the rainout of TiO and VO would hinder the formation of a stratosphere. To further expedite the development of consistent disequilibrium chemistry calculations in GCMs, we provide documentation and code for coupling the DS mixing method with correlated-k radiative transfer solvers. The DS method has been extensively tested to be accurate enough for GCMs, however, other methods might be needed for accelerating atmospheric retrievals.
    Non-Autoregressive Diffusion-based Temporal Point Processes for Continuous-Time Long-Term Event Prediction. (arXiv:2311.01033v1 [cs.LG])
    Continuous-time long-term event prediction plays an important role in many application scenarios. Most existing works rely on autoregressive frameworks to predict event sequences, which suffer from error accumulation, thus compromising prediction quality. Inspired by the success of denoising diffusion probabilistic models, we propose a diffusion-based non-autoregressive temporal point process model for long-term event prediction in continuous time. Instead of generating events one at a time in an autoregressive way, our model predicts the future event sequence entirely as a whole. In order to perform diffusion processes on event sequences, we develop a bidirectional map between target event sequences and the Euclidean vector space. Furthermore, we design a novel denoising network to capture both sequential and contextual features for better sample quality. Extensive experiments are conducted to prove the superiority of our proposed model over state-of-the-art methods on long-term event prediction in continuous time. To the best of our knowledge, this is the first work to apply diffusion methods to long-term event prediction problems.
    Scalable Counterfactual Distribution Estimation in Multivariate Causal Models. (arXiv:2311.00927v1 [stat.ML])
    We consider the problem of estimating the counterfactual joint distribution of multiple quantities of interests (e.g., outcomes) in a multivariate causal model extended from the classical difference-in-difference design. Existing methods for this task either ignore the correlation structures among dimensions of the multivariate outcome by considering univariate causal models on each dimension separately and hence produce incorrect counterfactual distributions, or poorly scale even for moderate-size datasets when directly dealing with such multivariate causal model. We propose a method that alleviates both issues simultaneously by leveraging a robust latent one-dimensional subspace of the original high-dimension space and exploiting the efficient estimation from the univariate causal model on such space. Since the construction of the one-dimensional subspace uses information from all the dimensions, our method can capture the correlation structures and produce good estimates of the counterfactual distribution. We demonstrate the advantages of our approach over existing methods on both synthetic and real-world data.
  • Open

    Federated Linear Bandits with Finite Adversarial Actions. (arXiv:2311.00973v1 [cs.LG])
    We study a federated linear bandits model, where $M$ clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of $\tilde{O}(\sqrt{d T})$, where $T$ is the total number of arm pulls from all clients, and $d$ is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as $O(d M^2 \log(d)\log(T))$ and $O(\sqrt{d^3 M^3} \log(d))$, respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of $\tilde{O} (\sqrt{d \sum \nolimits_{t=1}^{T} \sigma_t^2})$ can be achieved with $\sigma_t^2$ being the noise variance of round $t$; and (2) adversarial corruption, where a total regret of $\tilde{O}(\sqrt{dT} + d C_p)$ can be achieved with $C_p$ being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.
    Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization. (arXiv:2310.04015v3 [cs.LG] UPDATED)
    While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.
    PPI++: Efficient Prediction-Powered Inference. (arXiv:2311.01453v1 [stat.ML])
    We present PPI++: a computationally lightweight methodology for estimation and inference based on a small labeled dataset and a typically much larger dataset of machine-learning predictions. The methods automatically adapt to the quality of available predictions, yielding easy-to-compute confidence sets -- for parameters of any dimensionality -- that always improve on classical intervals using only the labeled data. PPI++ builds on prediction-powered inference (PPI), which targets the same problem setting, improving its computational and statistical efficiency. Real and synthetic experiments demonstrate the benefits of the proposed adaptations.
    Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation. (arXiv:2307.02598v2 [cs.LG] UPDATED)
    We tackle the problems of latent variables identification and ``out-of-support'' image generation in representation learning. We show that both are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.
    The Universal Statistical Structure and Scaling Laws of Chaos and Turbulence. (arXiv:2311.01358v1 [cond-mat.stat-mech])
    Turbulence is a complex spatial and temporal structure created by the strong non-linear dynamics of fluid flows at high Reynolds numbers. Despite being an ubiquitous phenomenon that has been studied for centuries, a full understanding of turbulence remained a formidable challenge. Here, we introduce tools from the fields of quantum chaos and Random Matrix Theory (RMT) and present a detailed analysis of image datasets generated from turbulence simulations of incompressible and compressible fluid flows. Focusing on two observables: the data Gram matrix and the single image distribution, we study both the local and global eigenvalue statistics and compare them to classical chaos, uncorrelated noise and natural images. We show that from the RMT perspective, the turbulence Gram matrices lie in the same universality class as quantum chaotic rather than integrable systems, and the data exhibits power-law scalings in the bulk of its eigenvalues which are vastly different from uncorrelated classical chaos, random data, natural images. Interestingly, we find that the single sample distribution only appears as fully RMT chaotic, but deviates from chaos at larger correlation lengths, as well as exhibiting different scaling properties.
    Neural Diffusion Models. (arXiv:2310.08337v1 [cs.LG] CROSS LISTED)
    Diffusion models have shown remarkable performance on many generative tasks. Despite recent success, most diffusion models are restricted in that they only allow linear transformation of the data distribution. In contrast, broader family of transformations can potentially help train generative distributions more efficiently, simplifying the reverse process and closing the gap between the true negative log-likelihood and the variational approximation. In this paper, we present Neural Diffusion Models (NDMs), a generalization of conventional diffusion models that enables defining and learning time-dependent non-linear transformations of data. We show how to optimise NDMs using a variational bound in a simulation-free setting. Moreover, we derive a time-continuous formulation of NDMs, which allows fast and reliable inference using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the utility of NDMs with learnable transformations through experiments on standard image generation benchmarks, including CIFAR-10, downsampled versions of ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms of likelihood and produce high-quality samples.
    Score-based Data Assimilation for a Two-Layer Quasi-Geostrophic Model. (arXiv:2310.01853v2 [stat.ML] UPDATED)
    Data assimilation addresses the problem of identifying plausible state trajectories of dynamical systems given noisy or incomplete observations. In geosciences, it presents challenges due to the high-dimensionality of geophysical dynamical systems, often exceeding millions of dimensions. This work assesses the scalability of score-based data assimilation (SDA), a novel data assimilation method, in the context of such systems. We propose modifications to the score network architecture aimed at significantly reducing memory consumption and execution time. We demonstrate promising results for a two-layer quasi-geostrophic model.
    Analysis of tidal flows through the Strait of Gibraltar using Dynamic Mode Decomposition. (arXiv:2311.01377v1 [math.DS])
    The Strait of Gibraltar is a region characterized by intricate oceanic sub-mesoscale features, influenced by topography, tidal forces, instabilities, and nonlinear hydraulic processes, all governed by the nonlinear equations of fluid motion. In this study, we aim to uncover the underlying physics of these phenomena within 3D MIT general circulation model simulations, including waves, eddies, and gyres. To achieve this, we employ Dynamic Mode Decomposition (DMD) to break down simulation snapshots into Koopman modes, with distinct exponential growth/decay rates and oscillation frequencies. Our objectives encompass evaluating DMD's efficacy in capturing known features, unveiling new elements, ranking modes, and exploring order reduction. We also introduce modifications to enhance DMD's robustness, numerical accuracy, and robustness of eigenvalues. DMD analysis yields a comprehensive understanding of flow patterns, internal wave formation, and the dynamics of the Strait of Gibraltar, its meandering behaviors, and the formation of a secondary gyre, notably the Western Alboran Gyre, as well as the propagation of Kelvin and coastal-trapped waves along the African coast. In doing so, it significantly advances our comprehension of intricate oceanographic phenomena and underscores the immense utility of DMD as an analytical tool for such complex datasets, suggesting that DMD could serve as a valuable addition to the toolkit of oceanographers.
    Gaussian Processes on Cellular Complexes. (arXiv:2311.01198v1 [cs.LG])
    In recent years, there has been considerable interest in developing machine learning models on graphs in order to account for topological inductive biases. In particular, recent attention was given to Gaussian processes on such structures since they can additionally account for uncertainty. However, graphs are limited to modelling relations between two vertices. In this paper, we go beyond this dyadic setting and consider polyadic relations that include interactions between vertices, edges and one of their generalisations, known as cells. Specifically, we propose Gaussian processes on cellular complexes, a generalisation of graphs that captures interactions between these higher-order cells. One of our key contributions is the derivation of two novel kernels, one that generalises the graph Mat\'ern kernel and one that additionally mixes information of different cell types.
    Time-Independent Information-Theoretic Generalization Bounds for SGLD. (arXiv:2311.01046v1 [cs.LG])
    We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds.
    Deep Transformed Gaussian Processes. (arXiv:2310.18230v2 [cs.LG] UPDATED)
    Transformed Gaussian Processes (TGPs) are stochastic processes specified by transforming samples from the joint distribution from a prior process (typically a GP) using an invertible transformation; increasing the flexibility of the base process. Furthermore, they achieve competitive results compared with Deep Gaussian Processes (DGPs), which are another generalization constructed by a hierarchical concatenation of GPs. In this work, we propose a generalization of TGPs named Deep Transformed Gaussian Processes (DTGPs), which follows the trend of concatenating layers of stochastic processes. More precisely, we obtain a multi-layer model in which each layer is a TGP. This generalization implies an increment of flexibility with respect to both TGPs and DGPs. Exact inference in such a model is intractable. However, we show that one can use variational inference to approximate the required computations yielding a straightforward extension of the popular DSVI inference algorithm Salimbeni et al (2017). The experiments conducted evaluate the proposed novel DTGPs in multiple regression datasets, achieving good scalability and performance.
    Invariant-Feature Subspace Recovery: A New Class of Provable Domain Generalization Algorithms. (arXiv:2311.00966v1 [cs.LG])
    Domain generalization asks for models trained over a set of training environments to generalize well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) have been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this work, we propose Invariant-feature Subspace Recovery (ISR): a new class of algorithms to achieve provable domain generalization across the settings of classification and regression problems. First, in the binary classification setup of Rosenfeld et al. (2021), we show that our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments. Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Next, we extend ISR-Mean to the more general setting of multi-class classification and propose ISR-Multiclass, which leverages class information and provably recovers the invariant-feature subspace with $\lceil d_s/k\rceil+1$ training environments for $k$-class classification. Finally, for regression problems, we propose ISR-Regression that can identify the invariant-feature subspace with $d_s+1$ training environments. Empirically, we demonstrate the superior performance of our ISRs on synthetic benchmarks. Further, ISR can be used as post-processing methods for feature extractors such as neural nets.
    Bridging Machine Learning and Sciences: Opportunities and Challenges. (arXiv:2210.13441v2 [stat.ML] UPDATED)
    The application of machine learning in sciences has seen exciting advances in recent years. As a widely applicable technique, anomaly detection has been long studied in the machine learning community. Especially, deep neural nets-based out-of-distribution detection has made great progress for high-dimensional data. Recently, these techniques have been showing their potential in scientific disciplines. We take a critical look at their applicative prospects including data universality, experimental protocols, model robustness, etc. We discuss examples that display transferable practices and domain-specific challenges simultaneously, providing a starting point for establishing a novel interdisciplinary research paradigm in the near future.
    Long Story Short: Omitted Variable Bias in Causal Machine Learning. (arXiv:2112.13398v4 [econ.EM] UPDATED)
    We derive general, yet simple, sharp bounds on the size of the omitted variable bias for a broad class of causal parameters that can be identified as linear functionals of the conditional expectation function of the outcome. Such functionals encompass many of the traditional targets of investigation in causal inference studies, such as, for example, (weighted) average of potential outcomes, average treatment effects (including subgroup effects, such as the effect on the treated), (weighted) average derivatives, and policy effects from shifts in covariate distribution -- all for general, nonparametric causal models. Our construction relies on the Riesz-Frechet representation of the target functional. Specifically, we show how the bound on the bias depends only on the additional variation that the latent variables create both in the outcome and in the Riesz representer for the parameter of interest. Moreover, in many important cases (e.g, average treatment effects and avearage derivatives) the bound is shown to depend on easily interpretable quantities that measure the explanatory power of the omitted variables. Therefore, simple plausibility judgments on the maximum explanatory power of omitted variables (in explaining treatment and outcome variation) are sufficient to place overall bounds on the size of the bias. Furthermore, we use debiased machine learning to provide flexible and efficient statistical inference on learnable components of the bounds. Finally, empirical examples demonstrate the usefulness of the approach.
    Add and Thin: Diffusion for Temporal Point Processes. (arXiv:2311.01139v1 [cs.LG])
    Autoregressive neural networks within the temporal point process (TPP) framework have become the standard for modeling continuous-time event data. Even though these models can expressively capture event sequences in a one-step-ahead fashion, they are inherently limited for long-term forecasting applications due to the accumulation of errors caused by their sequential nature. To overcome these limitations, we derive ADD-THIN, a principled probabilistic denoising diffusion model for TPPs that operates on entire event sequences. Unlike existing diffusion approaches, ADD-THIN naturally handles data with discrete and continuous components. In experiments on synthetic and real-world datasets, our model matches the state-of-the-art TPP models in density estimation and strongly outperforms them in forecasting.
    Fitted Value Iteration Methods for Bicausal Optimal Transport. (arXiv:2306.12658v2 [stat.ML] UPDATED)
    We develop a fitted value iteration (FVI) method to compute bicausal optimal transport (OT) where couplings have an adapted structure. Based on the dynamic programming formulation, FVI adopts a function class to approximate the value functions in bicausal OT. Under the concentrability condition and approximate completeness assumption, we prove the sample complexity using (local) Rademacher complexity. Furthermore, we demonstrate that multilayer neural networks with appropriate structures satisfy the crucial assumptions required in sample complexity proofs. Numerical experiments reveal that FVI outperforms linear programming and adapted Sinkhorn methods in scalability as the time horizon increases, while still maintaining acceptable accuracy.
    Dyadic Reinforcement Learning. (arXiv:2308.07843v5 [cs.LG] UPDATED)
    Mobile health aims to enhance health outcomes by delivering interventions to individuals as they go about their daily life. The involvement of care partners and social support networks often proves crucial in helping individuals managing burdensome medical conditions. This presents opportunities in mobile health to design interventions that target the dyadic relationship -- the relationship between a target person and their care partner -- with the aim of enhancing social support. In this paper, we develop dyadic RL, an online reinforcement learning algorithm designed to personalize intervention delivery based on contextual factors and past responses of a target person and their care partner. Here, multiple sets of interventions impact the dyad across multiple time intervals. The developed dyadic RL is Bayesian and hierarchical. We formally introduce the problem setup, develop dyadic RL and establish a regret bound. We demonstrate dyadic RL's empirical performance through simulation studies on both toy scenarios and on a realistic test bed constructed from data collected in a mobile health study.
    Tailoring Mixup to Data using Kernel Warping functions. (arXiv:2311.01434v1 [cs.LG])
    Data augmentation is an essential building block for learning efficient deep learning models. Among all augmentation techniques proposed so far, linear interpolation of training data points, also called mixup, has found to be effective for a large panel of applications. While the majority of works have focused on selecting the right points to mix, or applying complex non-linear interpolation, we are interested in mixing similar points more frequently and strongly than less similar ones. To this end, we propose to dynamically change the underlying distribution of interpolation coefficients through warping functions, depending on the similarity between data points to combine. We define an efficient and flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves both performance and calibration of models. Code available in https://github.com/ENSTA-U2IS/torch-uncertainty
    Gaussian Process Priors for Systems of Linear Partial Differential Equations with Constant Coefficients. (arXiv:2212.14319v4 [stat.ML] UPDATED)
    Partial differential equations (PDEs) are important tools to model physical systems and including them into machine learning models is an important way of incorporating physical knowledge. Given any system of linear PDEs with constant coefficients, we propose a family of Gaussian process (GP) priors, which we call EPGP, such that all realizations are exact solutions of this system. We apply the Ehrenpreis-Palamodov fundamental principle, which works as a non-linear Fourier transform, to construct GP kernels mirroring standard spectral methods for GPs. Our approach can infer probable solutions of linear PDE systems from any data such as noisy measurements, or pointwise defined initial and boundary conditions. Constructing EPGP-priors is algorithmic, generally applicable, and comes with a sparse version (S-EPGP) that learns the relevant spectral frequencies and works better for big data sets. We demonstrate our approach on three families of systems of PDEs, the heat equation, wave equation, and Maxwell's equations, where we improve upon the state of the art in computation time and precision, in some experiments by several orders of magnitude.
    Inversion of Bayesian Networks. (arXiv:2212.10649v2 [cs.LG] UPDATED)
    Variational autoencoders and Helmholtz machines use a recognition network (encoder) to approximate the posterior distribution of a generative model (decoder). In this paper we study the necessary and sufficient properties of a recognition network so that it can model the true posterior distribution exactly. These results are derived in the general context of probabilistic graphical modelling / Bayesian networks, for which the network represents a set of conditional independence statements. We derive both global conditions, in terms of d-separation, and local conditions for the recognition network to have the desired qualities. It turns out that for the local conditions the property perfectness (for every node, all parents are joined) plays an important role.
    A Coreset-based, Tempered Variational Posterior for Accurate and Scalable Stochastic Gaussian Process Inference. (arXiv:2311.01409v1 [cs.LG])
    We present a novel stochastic variational Gaussian process ($\mathcal{GP}$) inference method, based on a posterior over a learnable set of weighted pseudo input-output points (coresets). Instead of a free-form variational family, the proposed coreset-based, variational tempered family for $\mathcal{GP}$s (CVTGP) is defined in terms of the $\mathcal{GP}$ prior and the data-likelihood; hence, accommodating the modeling inductive biases. We derive CVTGP's lower bound for the log-marginal likelihood via marginalization of the proposed posterior over latent $\mathcal{GP}$ coreset variables, and show it is amenable to stochastic optimization. CVTGP reduces the learnable parameter size to $\mathcal{O}(M)$, enjoys numerical stability, and maintains $\mathcal{O}(M^3)$ time- and $\mathcal{O}(M^2)$ space-complexity, by leveraging a coreset-based tempered posterior that, in turn, provides sparse and explainable representations of the data. Results on simulated and real-world regression problems with Gaussian observation noise validate that CVTGP provides better evidence lower-bound estimates and predictive root mean squared error than alternative stochastic $\mathcal{GP}$ inference methods.
    On the Lipschitz constant of random neural networks. (arXiv:2311.01356v1 [stat.ML])
    Empirical studies have widely demonstrated that neural networks are highly sensitive to small, adversarial perturbations of the input. The worst-case robustness against these so-called adversarial examples can be quantified by the Lipschitz constant of the neural network. However, only few theoretical results regarding this quantity exist in the literature. In this paper, we initiate the study of the Lipschitz constant of random ReLU neural networks, i.e., neural networks whose weights are chosen at random and which employ the ReLU activation function. For shallow neural networks, we characterize the Lipschitz constant up to an absolute numerical constant. Moreover, we extend our analysis to deep neural networks of sufficiently large width where we prove upper and lower bounds for the Lipschitz constant. These bounds match up to a logarithmic factor that depends on the depth.
    Unreading Race: Purging Protected Features from Chest X-ray Embeddings. (arXiv:2311.01349v1 [cs.LG])
    Purpose: To analyze and remove protected feature effects in chest radiograph embeddings of deep learning models. Materials and Methods: An orthogonalization is utilized to remove the influence of protected features (e.g., age, sex, race) in chest radiograph embeddings, ensuring feature-independent results. To validate the efficacy of the approach, we retrospectively study the MIMIC and CheXpert datasets using three pre-trained models, namely a supervised contrastive, a self-supervised contrastive, and a baseline classifier model. Our statistical analysis involves comparing the original versus the orthogonalized embeddings by estimating protected feature influences and evaluating the ability to predict race, age, or sex using the two types of embeddings. Results: Our experiments reveal a significant influence of protected features on predictions of pathologies. Applying orthogonalization removes these feature effects. Apart from removing any influence on pathology classification, while maintaining competitive predictive performance, orthogonalized embeddings further make it infeasible to directly predict protected attributes and mitigate subgroup disparities. Conclusion: The presented work demonstrates the successful application and evaluation of the orthogonalization technique in the domain of chest X-ray classification.
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v5 [cs.LG] UPDATED)
    We provide the first finite-particle convergence rate for Stein variational gradient descent (SVGD), a popular algorithm for approximating a probability distribution with a collection of particles. Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.
    Kernel-based Joint Independence Tests for Multivariate Stationary and Non-stationary Time Series. (arXiv:2305.08529v3 [stat.ME] UPDATED)
    Multivariate time series data that capture the temporal evolution of interconnected systems are ubiquitous in diverse areas. Understanding the complex relationships and potential dependencies among co-observed variables is crucial for the accurate statistical modelling and analysis of such systems. Here, we introduce kernel-based statistical tests of joint independence in multivariate time series by extending the $d$-variable Hilbert-Schmidt independence criterion (dHSIC) to encompass both stationary and non-stationary processes, thus allowing broader real-world applications. By leveraging resampling techniques tailored for both single- and multiple-realisation time series, we show how the method robustly uncovers significant higher-order dependencies in synthetic examples, including frequency mixing data and logic gates, as well as real-world climate, neuroscience, and socioeconomic data. Our method adds to the mathematical toolbox for the analysis of multivariate time series and can aid in uncovering high-order interactions in data.
    On Learning Gaussian Multi-index Models with Gradient Flow. (arXiv:2310.19793v2 [stat.ML] UPDATED)
    We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated `saddle-to-saddle' dynamics. Notably, the timescales associated with each saddle can be explicitly characterized in terms of an appropriate Hermite decomposition of the target link function. In contrast with these positive results, we also show that the related \emph{planted} problem, where the link function is known and fixed, in fact has a rough optimization landscape, in which gradient flow dynamics might get trapped with high probability.
    Towards Characterizing the First-order Query Complexity of Learning (Approximate) Nash Equilibria in Zero-sum Matrix Games. (arXiv:2304.12768v2 [cs.GT] UPDATED)
    In the first-order query model for zero-sum $K\times K$ matrix games, players observe the expected pay-offs for all their possible actions under the randomized action played by their opponent. This classical model has received renewed interest after the discovery by Rakhlin and Sridharan that $\epsilon$-approximate Nash equilibria can be computed efficiently from $O(\frac{\ln K}{\epsilon})$ instead of $O(\frac{\ln K}{\epsilon^2})$ queries. Surprisingly, the optimal number of such queries, as a function of both $\epsilon$ and $K$, is not known. We make progress on this question on two fronts. First, we fully characterise the query complexity of learning exact equilibria ($\epsilon=0$), by showing that they require a number of queries that is linear in $K$, which means that it is essentially as hard as querying the whole matrix, which can also be done with $K$ queries. Second, for $\epsilon > 0$, the current query complexity upper bound stands at $O(\min(\frac{\ln(K)}{\epsilon} , K))$. We argue that, unfortunately, obtaining a matching lower bound is not possible with existing techniques: we prove that no lower bound can be derived by constructing hard matrices whose entries take values in a known countable set, because such matrices can be fully identified by a single query. This rules out, for instance, reducing to an optimization problem over the hypercube by encoding it as a binary payoff matrix. We then introduce a new technique for lower bounds, which allows us to obtain lower bounds of order $\tilde\Omega(\log(\frac{1}{K\epsilon})$ for any $\epsilon \leq 1 / (cK^4)$, where $c$ is a constant independent of $K$. We further discuss possible future directions to improve on our techniques in order to close the gap with the upper bounds.
    Generalized Bayesian Inference for Scientific Simulators via Amortized Cost Estimation. (arXiv:2305.15208v2 [stat.ML] UPDATED)
    Simulation-based inference (SBI) enables amortized Bayesian inference for simulators with implicit likelihoods. But when we are primarily interested in the quality of predictive simulations, or when the model cannot exactly reproduce the observed data (i.e., is misspecified), targeting the Bayesian posterior may be overly restrictive. Generalized Bayesian Inference (GBI) aims to robustify inference for (misspecified) simulator models, replacing the likelihood-function with a cost function that evaluates the goodness of parameters relative to data. However, GBI methods generally require running multiple simulations to estimate the cost function at each parameter value during inference, making the approach computationally infeasible for even moderately complex simulators. Here, we propose amortized cost estimation (ACE) for GBI to address this challenge: We train a neural network to approximate the cost function, which we define as the expected distance between simulations produced by a parameter and observed data. The trained network can then be used with MCMC to infer GBI posteriors for any observation without running additional simulations. We show that, on several benchmark tasks, ACE accurately predicts cost and provides predictive simulations that are closer to synthetic observations than other SBI methods, especially for misspecified simulators. Finally, we apply ACE to infer parameters of the Hodgkin-Huxley model given real intracellular recordings from the Allen Cell Types Database. ACE identifies better data-matching parameters while being an order of magnitude more simulation-efficient than a standard SBI method. In summary, ACE combines the strengths of SBI methods and GBI to perform robust and simulation-amortized inference for scientific simulators.
    Exclusive Group Lasso for Structured Variable Selection. (arXiv:2108.10284v2 [cs.LG] UPDATED)
    A structured variable selection problem is considered in which the covariates, divided into predefined groups, activate according to sparse patterns with few nonzero entries per group. Capitalizing on the concept of atomic norm, a composite norm can be properly designed to promote such exclusive group sparsity patterns. The resulting norm lends itself to efficient and flexible regularized optimization algorithms for support recovery, like the proximal algorithm. Moreover, an active set algorithm is proposed that builds the solution by successively including structure atoms into the estimated support. It is also shown that such an algorithm can be tailored to match more rigid structures than plain exclusive group sparsity. Asymptotic consistency analysis (with both the number of parameters as well as the number of groups growing with the observation size) establishes the effectiveness of the proposed solution in terms of signed support recovery under conventional assumptions. Finally, a set of numerical simulations further corroborates the results.
    Targeted Separation and Convergence with Kernel Discrepancies. (arXiv:2209.12835v2 [stat.ML] UPDATED)
    Maximum mean discrepancies (MMDs) like the kernel Stein discrepancy (KSD) have grown central to a wide range of applications, including hypothesis testing, sampler selection, distribution approximation, and variational inference. In each setting, these kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or even (ii) control weak convergence to P. In this article we derive new sufficient and necessary conditions to ensure (i) and (ii). For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels and for controlling convergence with bounded kernels. We use these results on $\mathbb{R}^d$ to substantially broaden the known conditions for KSD separation and convergence control and to develop the first KSDs known to exactly metrize weak convergence to P. Along the way, we highlight the implications of our results for hypothesis testing, measuring and improving sample quality, and sampling with Stein variational gradient descent.  ( 2 min )
    The Behavior and Convergence of Local Bayesian Optimization. (arXiv:2305.15572v2 [cs.LG] UPDATED)
    A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by M\"uller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.  ( 2 min )
    Computable Phenotypes of Patient Acuity in the Intensive Care Unit. (arXiv:2005.05163v2 [q-bio.QM] UPDATED)
    Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records and (2) describe transitions between acuity states that illustrate the clinical trajectories of ICU patients. We gathered two single-center, longitudinal electronic health record datasets for 51,372 adult ICU patients admitted to the University of Florida Health (UFH) Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acuity status at four-hour intervals for each ICU admission and identify acuity phenotypes using continuous acuity status and k-means clustering approach. 51,073 admissions for 38,749 patients in the UFH GNV dataset and 22,219 admissions for 12,623 patients in the UFH JAX dataset had at least one ICU stay lasting more than four hours. There were three phenotypes: persistently stable, persistently unstable, and transitioning from unstable to stable. For stable patients, approximately 0.7%-1.7% would transition to unstable, 0.02%-0.1% would expire, 1.2%-3.4% would be discharged, and the remaining 96%-97% would remain stable in the ICU every four hours. For unstable patients, approximately 6%-10% would transition to stable, 0.4%-0.5% would expire, and the remaining 89%-93% would remain unstable in the ICU in the next four hours. We developed phenotyping algorithms for patient acuity status every four hours while admitted to the ICU. This approach may be useful in developing prognostic and clinical decision-support tools to aid patients, caregivers, and providers in shared decision-making processes regarding escalation of care and patient values.  ( 3 min )
    Amortized Simulation-Based Frequentist Inference for Tractable and Intractable Likelihoods. (arXiv:2306.07769v2 [stat.ME] UPDATED)
    High-fidelity simulators that connect theoretical models with observations are indispensable tools in many sciences. When coupled with machine learning, a simulator makes it possible to infer the parameters of a theoretical model directly from real and simulated observations without explicit use of the likelihood function. This is of particular interest when the latter is intractable. In this work, we introduce a simple extension of the recently proposed likelihood-free frequentist inference (LF2I) approach that has some computational advantages. Like LF2I, this extension yields provably valid confidence sets in parameter inference problems in which a high-fidelity simulator is available. The utility of our algorithm is illustrated by applying it to three pedagogically interesting examples: the first is from cosmology, the second from high-energy physics and astronomy, both with tractable likelihoods, while the third, with an intractable likelihood, is from epidemiology.  ( 2 min )
    Discrepancy Modeling Framework: Learning missing physics, modeling systematic residuals, and disambiguating between deterministic and random effects. (arXiv:2203.05164v2 [stat.ML] UPDATED)
    Physics-based and first-principles models pervade the engineering and physical sciences, allowing for the ability to model the dynamics of complex systems with a prescribed accuracy. The approximations used in deriving governing equations often result in discrepancies between the model and sensor-based measurements of the system, revealing the approximate nature of the equations and/or the signal-to-noise ratio of the sensor itself. In modern dynamical systems, such discrepancies between model and measurement can lead to poor quantification, often undermining the ability to produce accurate and precise control algorithms. We introduce a discrepancy modeling framework to identify the missing physics and resolve the model-measurement mismatch with two distinct approaches: (i) by learning a model for the evolution of systematic state-space residual, and (ii) by discovering a model for the deterministic dynamical error. Regardless of approach, a common suite of data-driven model discovery methods can be used. The choice of method depends on one's intent (e.g., mechanistic interpretability) for discrepancy modeling, sensor measurement characteristics (e.g., quantity, quality, resolution), and constraints imposed by practical applications (e.g., modeling approaches using the suite of data-driven modeling methods on three continuous dynamical systems under varying signal-to-noise ratios. Finally, we emphasize structural shortcomings of each discrepancy modeling approach depending on error type. In summary, if the true dynamics are unknown (i.e., an imperfect model), one should learn a discrepancy model of the missing physics in the dynamical space. Yet, if the true dynamics are known yet model-measurement mismatch still exists, one should learn a discrepancy model in the state space.  ( 3 min )
    Sequence Modeling with Multiresolution Convolutional Memory. (arXiv:2305.01638v2 [cs.LG] UPDATED)
    Efficiently capturing the long-range patterns in sequential data sources salient to a given task -- such as classification and generative modeling -- poses a fundamental challenge. Popular approaches in the space tradeoff between the memory burden of brute-force enumeration and comparison, as in transformers, the computational burden of complicated sequential dependencies, as in recurrent neural networks, or the parameter burden of convolutional networks with many or large filters. We instead take inspiration from wavelet-based multiresolution analysis to define a new building block for sequence modeling, which we call a MultiresLayer. The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence. Our MultiresConv can be implemented with shared filters across a dilated causal convolution tree. Thus it garners the computational advantages of convolutional networks and the principled theoretical motivation of wavelet decompositions. Our MultiresLayer is straightforward to implement, requires significantly fewer parameters, and maintains at most a $\mathcal{O}(N\log N)$ memory footprint for a length $N$ sequence. Yet, by stacking such layers, our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks using CIFAR-10, ListOps, and PTB-XL datasets.  ( 2 min )
    Conformal Prediction for Time Series with Modern Hopfield Networks. (arXiv:2303.12783v2 [cs.LG] UPDATED)
    To quantify uncertainty, conformal prediction methods are gaining continuously more interest and have already been successfully applied to various domains. However, they are difficult to apply to time series as the autocorrelative structure of time series violates basic assumptions required by conformal prediction. We propose HopCPT, a novel conformal prediction approach for time series that not only copes with temporal structures but leverages them. We show that our approach is theoretically well justified for time series where temporal dependencies are present. In experiments, we demonstrate that our new approach outperforms state-of-the-art conformal prediction methods on multiple real-world time series datasets from four different domains.  ( 2 min )
    Manifold-augmented Eikonal Equations: Geodesic Distances and Flows on Differentiable Manifolds. (arXiv:2310.06157v2 [cs.CG] UPDATED)
    Manifolds discovered by machine learning models provide a compact representation of the underlying data. Geodesics on these manifolds define locally length-minimising curves and provide a notion of distance, which are key for reduced-order modelling, statistical inference, and interpolation. In this work, we propose a model-based parameterisation for distance fields and geodesic flows on manifolds, exploiting solutions of a manifold-augmented Eikonal equation. We demonstrate how the geometry of the manifold impacts the distance field, and exploit the geodesic flow to obtain globally length-minimising curves directly. This work opens opportunities for statistics and reduced-order modelling on differentiable manifolds.  ( 2 min )
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v2 [cs.LG] UPDATED)
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as a black-box optimization problem over the discrete chemical space. In practice, multiple conflicting objectives and costly evaluations (e.g., wet-lab experiments) make the diversity of candidates paramount. Computational methods have achieved initial success but still struggle with considering diversity in both objective and search space. To fill this gap, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. We further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. We empirically illustrate that HN-GFN has adequate capacity to generalize over preferences. Moreover, experiments in various real-world MOBO settings demonstrate that our framework predominantly outperforms existing methods in terms of candidate quality and sample efficiency. The code is available at https://github.com/violet-sto/HN-GFN.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v6 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional "content" latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining "texture" variables characterizing the diffusion process are synthesized at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving multiple datasets and image quality assessment metrics show that our approach yields stronger reported FID scores than the GAN-based model, while also yielding competitive performance with VAE-based models in several distortion metrics. Furthermore, training the diffusion with X-parameterization enables high-quality reconstructions in only a handful of decoding steps, greatly affecting the model's practicality.  ( 2 min )
    Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models. (arXiv:2311.00871v1 [cs.LG])
    Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this work, we study how effectively transformers can bridge between their pretraining data mixture, comprised of multiple distinct task families, to identify and learn new tasks in-context which are both inside and outside the pretraining distribution. Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of $(x, f(x))$ pairs rather than natural language. Our empirical results show transformers demonstrate near-optimal unsupervised model selection capabilities, in their ability to first in-context identify different task families and in-context learn within them when the task families are well-represented in their pretraining data. However when presented with tasks or functions which are out-of-domain of their pretraining data, we demonstrate various failure modes of transformers and degradation of their generalization for even simple extrapolation tasks. Together our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases that create fundamental generalization capabilities.  ( 2 min )
    Contrastive Moments: Unsupervised Halfspace Learning in Polynomial Time. (arXiv:2311.01435v1 [cs.LG])
    We give a polynomial-time algorithm for learning high-dimensional halfspaces with margins in $d$-dimensional space to within desired TV distance when the ambient distribution is an unknown affine transformation of the $d$-fold product of an (unknown) symmetric one-dimensional logconcave distribution, and the halfspace is introduced by deleting at least an $\epsilon$ fraction of the data in one of the component distributions. Notably, our algorithm does not need labels and establishes the unique (and efficient) identifiability of the hidden halfspace under this distributional assumption. The sample and time complexity of the algorithm are polynomial in the dimension and $1/\epsilon$. The algorithm uses only the first two moments of suitable re-weightings of the empirical distribution, which we call contrastive moments; its analysis uses classical facts about generalized Dirichlet polynomials and relies crucially on a new monotonicity property of the moment ratio of truncations of logconcave distributions. Such algorithms, based only on first and second moments were suggested in earlier work, but hitherto eluded rigorous guarantees. Prior work addressed the special case when the underlying distribution is Gaussian via Non-Gaussian Component Analysis. We improve on this by providing polytime guarantees based on Total Variation (TV) distance, in place of existing moment-bound guarantees that can be super-polynomial. Our work is also the first to go beyond Gaussians in this setting.  ( 2 min )
    Generalizing Nonlinear ICA Beyond Structural Sparsity. (arXiv:2311.00866v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to uncover the true latent sources from their observable nonlinear mixtures. Despite its significance, the identifiability of nonlinear ICA is known to be impossible without additional assumptions. Recent advances have proposed conditions on the connective structure from sources to observed variables, known as Structural Sparsity, to achieve identifiability in an unsupervised manner. However, the sparsity constraint may not hold universally for all sources in practice. Furthermore, the assumptions of bijectivity of the mixing process and independence among all sources, which arise from the setting of ICA, may also be violated in many real-world scenarios. To address these limitations and generalize nonlinear ICA, we propose a set of new identifiability results in the general settings of undercompleteness, partial sparsity and source dependence, and flexible grouping structures. Specifically, we prove identifiability when there are more observed variables than sources (undercomplete), and when certain sparsity and/or source independence assumptions are not met for some changing sources. Moreover, we show that even in cases with flexible grouping structures (e.g., part of the sources can be divided into irreducible independent groups with various sizes), appropriate identifiability results can also be established. Theoretical claims are supported empirically on both synthetic and real-world datasets.  ( 2 min )
    Stochastic Smoothed Gradient Descent Ascent for Federated Minimax Optimization. (arXiv:2311.00944v1 [stat.ML])
    In recent years, federated minimax optimization has attracted growing interest due to its extensive applications in various machine learning tasks. While Smoothed Alternative Gradient Descent Ascent (Smoothed-AGDA) has proved its success in centralized nonconvex minimax optimization, how and whether smoothing technique could be helpful in federated setting remains unexplored. In this paper, we propose a new algorithm termed Federated Stochastic Smoothed Gradient Descent Ascent (FESS-GDA), which utilizes the smoothing technique for federated minimax optimization. We prove that FESS-GDA can be uniformly used to solve several classes of federated minimax problems and prove new or better analytical convergence results for these settings. We showcase the practical efficiency of FESS-GDA in practical federated learning tasks of training generative adversarial networks (GANs) and fair classification.  ( 2 min )
    Scalable Counterfactual Distribution Estimation in Multivariate Causal Models. (arXiv:2311.00927v1 [stat.ML])
    We consider the problem of estimating the counterfactual joint distribution of multiple quantities of interests (e.g., outcomes) in a multivariate causal model extended from the classical difference-in-difference design. Existing methods for this task either ignore the correlation structures among dimensions of the multivariate outcome by considering univariate causal models on each dimension separately and hence produce incorrect counterfactual distributions, or poorly scale even for moderate-size datasets when directly dealing with such multivariate causal model. We propose a method that alleviates both issues simultaneously by leveraging a robust latent one-dimensional subspace of the original high-dimension space and exploiting the efficient estimation from the univariate causal model on such space. Since the construction of the one-dimensional subspace uses information from all the dimensions, our method can capture the correlation structures and produce good estimates of the counterfactual distribution. We demonstrate the advantages of our approach over existing methods on both synthetic and real-world data.  ( 2 min )
    Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis. (arXiv:2311.01052v1 [stat.ML])
    We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the Winner-Takes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization problem, demonstrating its practical usefulness and the relevance of its interpretation.  ( 2 min )
    High-dimensional Linear Bandits with Knapsacks. (arXiv:2311.01327v1 [cs.LG])
    We study the contextual bandits with knapsack (CBwK) problem under the high-dimensional setting where the dimension of the feature is large. The reward of pulling each arm equals the multiplication of a sparse high-dimensional weight vector and the feature of the current arrival, with additional random noise. In this paper, we investigate how to exploit this sparsity structure to achieve improved regret for the CBwK problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime. We finally conduct numerical experiments to show the efficient empirical performance of our algorithms under the high dimensional setting.  ( 2 min )
    Time-series Generation by Contrastive Imitation. (arXiv:2311.01388v1 [stat.ML])
    Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks.  ( 2 min )
    Bounding Wasserstein distance with couplings. (arXiv:2112.03152v3 [stat.CO] UPDATED)
    Markov chain Monte Carlo (MCMC) provides asymptotically consistent estimates of intractable posterior expectations as the number of iterations tends to infinity. However, in large data applications, MCMC can be computationally expensive per iteration. This has catalyzed interest in approximating MCMC in a manner that improves computational speed per iteration but does not produce asymptotically consistent estimates. In this article, we propose estimators based on couplings of Markov chains to assess the quality of such asymptotically biased sampling methods. The estimators give empirical upper bounds of the Wasserstein distance between the limiting distribution of the asymptotically biased sampling method and the original target distribution of interest. We establish theoretical guarantees for our upper bounds and show that our estimators can remain effective in high dimensions. We apply our quality measures to stochastic gradient MCMC, variational Bayes, and Laplace approximations for tall data and to approximate MCMC for Bayesian logistic regression in 4500 dimensions and Bayesian linear regression in 50000 dimensions.  ( 2 min )
    Data-Driven Model Selections of Second-Order Particle Dynamics via Integrating Gaussian Processes with Low-Dimensional Interacting Structures. (arXiv:2311.00902v1 [stat.ML])
    In this paper, we focus on the data-driven discovery of a general second-order particle-based model that contains many state-of-the-art models for modeling the aggregation and collective behavior of interacting agents of similar size and body type. This model takes the form of a high-dimensional system of ordinary differential equations parameterized by two interaction kernels that appraise the alignment of positions and velocities. We propose a Gaussian Process-based approach to this problem, where the unknown model parameters are marginalized by using two independent Gaussian Process (GP) priors on latent interaction kernels constrained to dynamics and observational data. This results in a nonparametric model for interacting dynamical systems that accounts for uncertainty quantification. We also develop acceleration techniques to improve scalability. Moreover, we perform a theoretical analysis to interpret the methodology and investigate the conditions under which the kernels can be recovered. We demonstrate the effectiveness of the proposed approach on various prototype systems, including the selection of the order of the systems and the types of interactions. In particular, we present applications to modeling two real-world fish motion datasets that display flocking and milling patterns up to 248 dimensions. Despite the use of small data sets, the GP-based approach learns an effective representation of the nonlinear dynamics in these spaces and outperforms competitor methods.  ( 3 min )

  • Open

    [Research] Which python libraries do you recommend for label ranking?
    I'm currently looking for python libraries that offer models for the label ranking problem. So provided with a context x and a set of Labels Y, the model should output a ranking of those labels. I'm mostly interested in models that implements the Ranking by comparison method (RPC) and the Plackett-Luce Model. I would be grateful for any hints. submitted by /u/Emergency_Caramel_69 [link] [comments]  ( 9 min )
    [D] Tuning Models for Closed Book Q&A
    After transitioning from a different career path into ML/AI, I've embarked on a self-learning journey that's been incredibly rewarding yet challenging at times. I'd love to get your insights on a project I'm working on. Here’s what I’ve done so far: Dived into AI research papers and theoretical foundations. Set up a personal server with high-end GPUs. Conducted experiments with models like Llama2. Implemented low-rank adapters in models for document generation. Trained transformers using pure PyTorch (steering clear of Hugging Face from now on). My current challenge is training a model that can perform Q&A on documents it has generated. While the Flan paper has provided some direction, practical application has proven to be complex, especially when balancing resource use and noise constraints(servers are very loud and I'd like to minimize training time for this reason). I'm reaching out for: Insights on training Llama (or similar models) for Q&A tasks on generated content. Experiences or lessons learned from implementing advice from the Flan paper or similar research. Tips on efficient resource management during training (to alleviate time, noise, and power concerns). Your collective wisdom would be a beacon for a lone learner like myself. If there are foundational concepts I might be missing or resources I should consult, please point me in the right direction. Thank you for your time and help! submitted by /u/TheRealBracketMaster [link] [comments]  ( 9 min )
    [D] [P] First time delving into Gen AI
    Hi, I need to make a powerpoint presentation for a startup founder who might hire me. He owns a digital/social media marketing company that also does website development. So I need to show the following things in the presentation:- Create an automated LDM model/service with very high accuracy (>90%) for UI/UX design. Goal is to minimize the role of graphic designers as much as possible. Create an automated LLM model/service with very high accuracy (>95%) for onboarding/handling customers as well as customer support. Currently, the task is managed by a bunch of product managers and ChatGPT powered chatbots. Goal is to minimize the role of these product managers as much as possible. Besides the above, if you guys have any other AI solutions that I could include then I'd be very grateful. The more valuable solutions I can pitch to the founder, the better chances I'll have to getting a job in his company. Please help me because I've never really played with Gen AI before. Please point me in the right direction 🙏 Thanks in advance! submitted by /u/master-killerrr [link] [comments]
    [D] Label my own data for Fine-tuning OCR?
    Hi! I was wondering if anyone could give me a heads up on what I can do to label my own corpus of data to fine tune an OCR model. An example of what I want to detect better is shown here https://files.catbox.moe/bnsb2i.png Right now stuff like amazon textextract can't pickup the superscript footnote markers (the small 1,2,3) very accurately, and I'd like to manually mark some datasets to hopefully increase this. This is actually a very common book format, and yeah I do realize it will take a lot of time but it's better than nothing I suppose. Thanks and God bless! submitted by /u/angel__-__- [link] [comments]  ( 9 min )
    [D]Does view function in PyTorch cause loss of information? I want to understand more about tensor dimension compression
    So let's say I have an encoded tensor of size [1024,4,20] which represents [number_of_people, 4 choices, embedding_dim] Which mean each individual person will have 4 choices on, let's say food, and this differ accross the entire population (which is now a batch). This tensor is already encoded with torch.nn.Embedding, where 20 is the dim size. I need to compress this into a tensor of size [4,20] so that all those "meal choices" information of each unique person is now a 2d tensor, since I need to multiply this matrix with another embeddings. I'm still in my learning phase with pytorch and trying to go through documents on dimension reduction (I'm familiar with all the reshaping and etc, but this is the first time I actually have to really think and be careful since it could mean I might lose information if I don't do this correctly). Could someone explain to me why using certain things, like view, would not lose these specific information about each individual person and ensuring that I can carry on to use this compressed "choices embedding" on other thing with a peace of mind that I'm doing it correctly. I guess this comes with my lack of understanding in linear algebra, any guidance on how to understand this topic deeply would be greatly appreciated. I was trying to get a resulting tensor of size [1024,4] but my scoring function would give the tensor of size [1024,1024,4] if I multiple [1024,20] with [1024,20,4] (permuted choices tensor). At this point I'm also not sure if I should reduce it earlier on with original encoded tensor or as the last step on the resulting tensor. submitted by /u/parz01000101 [link] [comments]  ( 9 min )
    [D] What do you fine folks think of this - Causal AI ?
    Stanford blog post - https://ssir.org/articles/entry/the_case_for_causal_ai Found this other link that seems more ready - https://causalens.com/ What do you think ? Next wonder or smokescreen ? Update - Found causalens website - https://preview.redd.it/x8azqhw136yb1.png?width=527&format=png&auto=webp&s=2d0cb3e0139233b2e9f9598e10d6060a6a4105cb Makes me wonder why this cannot be implemented with a ML / DL model ? Its human designed input to be structured as a constraint.. what do you think ? submitted by /u/dpadhy [link] [comments]  ( 9 min )
    [D] Comparing RL and LLM Prompting for Modern Game Playing AI Systems (a video)
    Hey people, I wanted to share a video from my ML YouTube channel discussing the state of the art methods for game playing AI systems post the LLM boom. Of course, this space was dominated by Reinforcement Learning for most of the 2010s, but there has been some interesting work towards using LLMs solo or as an “RL assistant” to train better RL agents. Here’s my video breaking down the complex prompting systems that let LLMs like GPT4 play Minecraft-like open world games and reflect on their progress. Hope people who are interested find it worthwhile… https://youtu.be/cXfnNoMgCio submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [P] Vector quantization methods
    ​ https://preview.redd.it/mq8y8as0q4yb1.jpg?width=1374&format=pjpg&auto=webp&s=41a28968c929bf5a44fab03bef67991319e5728b txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows. txtai has built-in support for quantizing vectors. The code above shows how 1-bit (binary) quantization can be applied. With 1-bit quantization, each dimension is transformed into a 1-bit value (0 or 1). Those bits are grouped into uint8's. This method can retain a surprising amount of accuracy, especially with high dimension models. See the article below for more information along with benchmarks. Article: https://neuml.hashnode.dev/all-about-vector-quantization GitHub: https://github.com/neuml/txtai submitted by /u/davidmezzetti [link] [comments]  ( 9 min )
    [D] Overfitting
    Assuming that my training and testing set have the same distribution. If my training performance is very good but clearing overfitting, can I say that there MUST be a way to tune the model so that the testing performance will also improve? Or what other assumptions on my data do I have to make? submitted by /u/jef_107 [link] [comments]  ( 9 min )
    [D] Seeking Clarification on LoRA, Adapters, and Prefix Tuning in LLMs
    I've come across several "parameters efficient" finetuning methods, LoRA (Low-Rank Adaptation), adapters, and prefix tuning, and I'm trying to understand the differences between them in the context of LLMs. I'm curious about the specific advantages and disadvantages of each method. For instance, how does the efficiency and performance of LoRA (that modify a selected subset of parameters) compare to adapters and prefix tuning (that add a small amount of parameters)? Are there any significant trade-offs to consider when choosing between these methods? I've come across a lots of LLMs trained with LoRA, but I'm struggling to find models trained with adapters or prefix tuning. Any guidance on this would be greatly appreciated. Also, Is it possible to use LoRA, adapters, and prefix tuning simultaneously in a single LLM? If so, are there any known benefits or drawbacks to this approach? Thank you in advance for your insights. submitted by /u/Distinct-Target7503 [link] [comments]  ( 9 min )
    [D] Making an autoencoder rotation invariant for image clustering?
    I'm trying to cluster PDF files that I've converted into images, and I've gotten a good suggestion to train an autoencoder with convolutional layers and cluster in the latent space. I'm hoping to implement this with Keras. The problem I'm running into is that these PDF files are scans, so some of the files are slightly rotated, and some of them are rotated by a full 90 degrees. As far as I know autoencoders are generally not rotation invariant, and all I was able to find online is a solution to a weird problem that involves 2d images of objects rotated in 3d. Is there a way to make an autoencoder that does have simple rotation invariance? submitted by /u/TeenColonistWrangler [link] [comments]  ( 9 min )
    [D] Data extraction with LLMs: JSON, CSV or …?
    I’ve been reading about a bunch of different methods for extracting structured data from text with LLMs (from docs, audio transcripts, etc). One approach is entity extraction into a connected knowledge graph (ie people, places, things). Another is providing a JSON schema to extract into, and outputting JSON. I’ve also seen table extraction and outputting CSV. 🙋‍♂️ If you’ve been using (or want to use) LLM data extraction in your workflows, which method have you been using (or are looking to use in future)? I’d be interested to learn what methods are needed for real apps, vs what’s just been used for one-off demos. Appreciate any insight! submitted by /u/DeadPukka [link] [comments]
    [D] About Scientific Machine Learning
    [D] Hi everyone! I am new to SciML. When I read papers like Neural ODEs or Liquid time-constant Neural Network (LTC), there are both familiar and new principles in mathematics. I can use Google or chatGPT to understand the new principles but I am looking for books that I can learn more and dive dive into the field of SciML. Anyone can suggest such kinds of books. Thank you! submitted by /u/luciffer_ [link] [comments]
    [D] Trying to remember the name of a famous paper...
    I'm trying to find a paper a I read a while back. I believe I heard about it from this subreddit. It was old. Maybe even from the 50s or 60s. The way I remember it, it starts by discussing some general properties of entropy and then derives logistic regression as a maximum entropy model. It had sort of a physics/information theory flavor to it. At least thats how I remember it. Does that sound familiar to anyone? ​ edit: Found it thanks to /u/TastyOs. "Information theory and statistical mechanics" by ET Jaynes (1957). https://journals.aps.org/pr/abstract/10.1103/PhysRev.106.620 Non paywall version: https://bayes.wustl.edu/etj/articles/theory.1.pdf Although now I feel like "Entropy shall be all that a Man requires" would have been a much better title. submitted by /u/12tone [link] [comments]  ( 9 min )
    [D] ViT model design
    Hello everyone, I have described my understanding of the ViT architecture in the below diagram, I did not use normalization, skip connection or Multi-head Self-attention. Otherwise, if you find any mistake please let me know. Further, I have some questions. 1- In self-attention according to my understanding we pass all 256 + 1 tokens (that is basically the whole image +1 class token) to the 3 different linear layers. and find the similarity between the 2 and then do element-wise multiplication with the third to increase or decrease the magnitude, is this correct? 2- Do these linear layers in self-attention have any relationship between them or are 3 distinguish layers? 3- Further I don't understand the concept behind the class token, why we are using that if we remove that and use all 256 tokens (that will make more sense) in the classification layer is it not possible? ​ ​ https://preview.redd.it/jcmfwx1x32yb1.jpg?width=1200&format=pjpg&auto=webp&s=1d3757feb06e7576e64c5725bd460c760eadf9af submitted by /u/NoEntertainment6225 [link] [comments]  ( 9 min )
    [D] Model downloading speed is limited if not logged in to HuggingFace, help!
    When I directly click the download button on HuggingFace website repo: (Logged in) Download speed 10m/s (Not logged in) Download speed less than 500k/s I have searched over the entire internet and cannot find why. The problem is that, now I have to use git clone to download the model with command line in the linux server, however even if I use HuggingFace-cli and my token, the speed is still less than 500k/s, same as "not logged in". Does anyone have any idea? This problem is really confusing. submitted by /u/CindyIH [link] [comments]  ( 9 min )
    [R] Telling GPT-4 you're scared or under pressure improves performance
    In a recent paper, researchers have discovered that LLMs show enhanced performance when provided with prompts infused with emotional context, which they call "EmotionPrompts." These prompts incorporate sentiments of urgency or importance, such as "It's crucial that I get this right for my thesis defense," as opposed to neutral prompts like "Please provide feedback." The study's empirical evidence suggests substantial gains. This indicates a significant sensitivity of LLMs to the implied emotional stakes in a prompt: Deterministic tasks saw an 8% performance boost Generative tasks experienced a 115% improvement when benchmarked using BIG-Bench. Human evaluators further validated these findings, observing a 10.9% increase in the perceived quality of responses when EmotionPrompts were used. This enhancement is attributed to the models' capacity to detect and prioritize the heightened language patterns that imply a need for precision and care in the response. The research delineates the potential of EmotionPrompts to refine the effectiveness of AI in applications where understanding the user's intent and urgency is paramount, even though the AI does not genuinely comprehend or feel emotions. TLDR: Research shows LLMs deliver better results when prompts signal emotional urgency. This insight can be leveraged to improve AI applications by integrating EmotionPrompts into the design of user interactions. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Recent discussion on X on doing a PhD vs working in industry
    This tweet (https://x.com/sshkhr16/status/1719721507872506090?s=46) gathered a lot of discussion with each side making a lot of good points. Would be great to know opinions of folks in this community. Would be really helpful especially to people deciding what to do after their undergrad maybe. I made a post a while before but I realized the title was kinda misleading so making a new one. submitted by /u/doppler_effects [link] [comments]  ( 9 min )
  • Open

    Dreamer on classic control like CartPole
    Has anyone ever seen a simple implementation of Dreamer on classic environments like CartPole ? There are tons and tons of examples of policy gradients algorithms for CartPole and so, but I haven't seen any results of Dreamer on a simple env like CartPole : a majority of Dreamer implementations (v1,v2,v3) focuses on visual environments, a few say that they are compatible with vector-only envs but don't provide any results nor config to work with these. Typically, I'm looking for a simple implementation that doesn't uses fancy tricks (like parallel processes, etc) and works out of the box on a simple env like CartPole. Of course, I know that Dreamer isn't meant for such simple environments but for educational purposes, I think it's important to start with something as simple as possible. ​ Thank you! submitted by /u/alexandretorres_ [link] [comments]
    Multi Agent PPO in Grid World environment using PettingZoo
    I'm trying to create a simple environment using PettingZoo's parallel API: a 12x12 grid with fixed obstacles. I'm trying to train 3 agents using PPO , with the goal to cover the grid entirely. (Exploration/ Coverage task). Here's the entire colab notebook (with outputs) for the same: https://colab.research.google.com/drive/1yF4aRuQ0eZUIsboaoKZAx8JHgvHKLoof?usp=sharing Now, from what I see from the training statistics, the loss values are steadily decreasing and are converging to zero, which suggests that the model is being trained properly. However, when I evaluate the optimal policy, the results are not very good, with the average rewards of the 3 agents being 6.6, -6 and -6. Also, the environment was a pretty simple one, with just 2 obstacles. I tried testing my custom environment using PettingZoo's parallel api, and it threw an assertion error. (The last cell). The problem could be my environment is not properly formed. How can I debug this? And apart from this, what changes should I make to my PPO training loop ? Changing the architecture is an obvious option but I want to make sure all the basic stuff is correct before doing that. A simple MLP policy should work on such a simple environment. ​ submitted by /u/esem29 [link] [comments]
    CartPole equivalent enviroments in MARL?
    Hi all! I'm learning some MARL and in implementing some algorithms I'd like to test them on some simple environments. In the single agent setting people generally go to something like CartPole. Are there similar environments in MARL? For cooperative/zero-sum/general sum? submitted by /u/1cedrake [link] [comments]
    Parametrization of the Policy in Policy-based Methods
    Why is it the case that people mostly use neural networks to parameterize policies in policy-based methods, rather than probability distributions? Are there situations, where there is a stronger case to use the latter? submitted by /u/MomoSolar [link] [comments]
    "Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models", Fu et al 2023 (self-attention learns higher-order gradient descent)
    submitted by /u/gwern [link] [comments]
  • Open

    I used Blender and Stable Diffusion to make image mixes between human and AI art.
    Title says it all. Mixing these two art worlds is quite fun btw (Spoiler note: I wasn't sure if the second image counts as suggestive due to the bottom clothing, So I marked it just incase. Also, If this gets removed because of the second pic, I completely get it XD) (Watermark note: I've made my reddit account before naming myself CappyAdams/YuriMayori. These images are still created by me and I'm happy to provide proof for those who don't believe me) https://preview.redd.it/a6e1m2zkv6yb1.png?width=3840&format=png&auto=webp&s=451ca2907c2227c844dece80630daf014c1f2272 https://preview.redd.it/he9f3utlv6yb1.png?width=3160&format=png&auto=webp&s=7ac81112f10066429847a23700a3af2268ba3e8d submitted by /u/SonicaNorth [link] [comments]
    What's the cheapest (but nor free) AI chat app that can become your friend / girlfriend/ family?
    Hi! I want to use an app that costs about $6 or less per month to make friends with their different characters. I don't want to pay more. I know there's many, but they're all above $6 / month. submitted by /u/Trainer_Red99 [link] [comments]
    AI one-percenters seizing power forever is the real doomsday scenario, warns AI godfather
    submitted by /u/donutloop [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com Luma AI introduced Genie, a generative 3D foundation model in research preview. It’s free during research preview via Discord [Details]. Nous Research released Obsidian, the world's first 3B multi-modal model family pre-trained for 4 Trillion tokens that runs locally on iPhones. Obsidian competes in benchmarks withWizardLM-13B and GPT4-X-Vicuna 13B and is based on CapybaraV1.9 [Details]. Phind has released a new model Phind Model V7 that matches and exceeds GPT-4's coding abilities while running 5x faster and having16k context [Details]. Runway released an update for both text to video and image to video generation with Gen-2, bringing major improvements to both the fidelity and consistency of video results [Link]. Stability AI announced [Details]: Sta…
    Is Medicine going to turn into a job where you manage multiple AI/LLM tools to use as CoPilot
    What do you guys think is going to happen. I am a medical student and I have played around with LLMs a lot. Is medicine going to turn into this role where Doctor, Patient, LLMs (not just 1 but multiple agents) all work together for patient care? In the sense of what excel did for accountants, will LLMs do the same for doctors? Not just 1 LLM, but multiple LLM agents interfacing with each other as well working with a doctor in a symbiotic role. Doctors already spend a LOT of time in front of EHRs too. People say medicine will go back about being in person, but I feel like it will go in other direction and be EVEN more computer focused submitted by /u/derpgod123 [link] [comments]
    I made a website where you can ask the same question to GPT-2, GPT-3, ChatGPT and GPT-4, and compare the outputs
    submitted by /u/timegentlemenplease_ [link] [comments]
    Tommorow a new Chat BOT competitor is coming for a select group of people!
    submitted by /u/Unreal_777 [link] [comments]
    Entering AI era, Taiwan chip industry urges renewables push
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 11/3/2023
    Google today is launching a set of generative AI product imagery tools for advertisers in the U.S. Via the new, AI-powered Product Studio, merchants and advertisers will be able to leverage text-to-image AI capabilities to create new product imagery for free, simply by typing in a prompt of the image they want to use.[1] Ilya Sutskever, the co-founder and chief scientist of OpenAI, envisions a future where humans could merge with machines, and where machines might attain human-like intelligence.[2] Instagram has been spotted developing an “AI friend” feature that users would be able to customize to their liking and then converse with, according to screenshots shared by app researcher Alessandro Paluzzi. Users would be able to chat with the AI to “answer questions, talk through any challenges, brainstorm ideas and much more,” according to screenshots of the feature.[3] Mural on Wednesday released an integration with Microsoft 365 Copilot as well as Mural AI, its native generative AI tool.[4] Sources: [1] https://techcrunch.com/2023/11/01/google-launches-generative-ai-tools-for-product-imagery-to-u-s-advertisers/ [2] https://www.adgully.com/openai-s-ilya-sutskever-unlocks-ai-s-future-138365.html [3] https://techcrunch.com/2023/11/01/instagram-spotted-developing-a-customizable-ai-friend/ [4] https://www.techtarget.com/searchunifiedcommunications/news/366558012/Mural-intros-Mural-AI-integrates-with-Microsoft-365-Copilot submitted by /u/Excellent-Target-847 [link] [comments]
    Telling GPT-4 you're scared or under pressure improves performance
    In a recent paper, researchers have discovered that LLMs show enhanced performance when provided with prompts infused with emotional context, which they call "EmotionPrompts." These prompts incorporate sentiments of urgency or importance, such as "It's crucial that I get this right for my thesis defense," as opposed to neutral prompts like "Please provide feedback." The study's empirical evidence suggests substantial gains. This indicates a significant sensitivity of LLMs to the implied emotional stakes in a prompt: Deterministic tasks saw an 8% performance boost Generative tasks experienced a 115% improvement when benchmarked using BIG-Bench. Human evaluators further validated these findings, observing a 10.9% increase in the perceived quality of responses when EmotionPrompts were used. This enhancement is attributed to the models' capacity to detect and prioritize the heightened language patterns that imply a need for precision and care in the response. The research delineates the potential of EmotionPrompts to refine the effectiveness of AI in applications where understanding the user's intent and urgency is paramount, even though the AI does not genuinely comprehend or feel emotions. TLDR: Research shows LLMs deliver better results when prompts signal emotional urgency. This insight can be leveraged to improve AI applications by integrating EmotionPrompts into the design of user interactions. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]
    Back propagation alternatives
    I understand that before back propagation was developed there were other methods used such as hebbian learning, and admittedly I know nothing about these old methods. But as I've learned about back prop in wondering is there a line of research working on alternatives? It seems amazing but also so highly incremental and blind that I wonder if there's a better way. One of it's major drawbacks is the fact that the information must pass through the entire structure rather than getting immediate feedback. Anyway, thanks! submitted by /u/Stack3 [link] [comments]
  • Open

    Fax machines in the 21st century
    There are still tens of millions of fax machines still exist. My business line gets calls from modems and fax machines fairly often. Maybe my number is close to that of a fax machine. Fax machines and health care Fax machines are especially common in health care. I remember when I was working at MD […] Fax machines in the 21st century first appeared on John D. Cook.  ( 6 min )
    Blog RSS feed
    I got an email from someone saying the RSS feed for this site stopped working. Anyone else having this problem? I subscribe to my RSS feed and it’s working fine for me. It may be that there are variations on the RSS feed, and the version I’m using works while the variation some others use […] Blog RSS feed first appeared on John D. Cook.  ( 5 min )
    Solitons and the KdV equation
    Rarely does a nonlinear differential equation, especially a nonlinear partial differential equation, have a closed-form solution. But that is the case for the Korteweg–De Vries equation. (Technically I should say it’s rare for a naturally-occurring nonlinear differential equation to have a closed-form solution. You can always start with a solution and cook up a contrived […] Solitons and the KdV equation first appeared on John D. Cook.  ( 6 min )
  • Open

    Best of both worlds: Achieving scalability and quality in text clustering
    Posted by Sara Ahmadian and Mehran Kazemi, Research Scientists, Google Research Clustering is a fundamental, ubiquitous problem in data mining and unsupervised machine learning, where the goal is to group together similar items. The standard forms of clustering are metric clustering and graph clustering. In metric clustering, a given metric space defines distances between data points, which are grouped together based on their separation. In graph clustering, a given graph connects similar data points through edges, and the clustering process groups data points together based on the connections between them. Both clustering forms are particularly useful for large corpora where class labels can’t be defined. Examples of such corpora are the ever-growing digital text collections of variou…  ( 92 min )
  • Open

    ‘Starship for the Mind’: University of Florida Opens Malachowsky Hall, an Epicenter for AI and Data Science
    Embodying the convergence of AI and academia, the University of Florida Friday inaugurated the Malachowsky Hall for Data Science & Information Technology. The sleek, seven-story building is poised to play a pivotal role in UF’s ongoing efforts to harness the transformative power of AI, reaffirming its stature as one of the nation’s leading public universities. Read article >  ( 6 min )
    How AI-Based Cybersecurity Strengthens Business Resilience
    The world’s 5 billion internet users and nearly 54 billion devices generate 3.4 petabytes of data per second, according to IDC. As digitalization accelerates, enterprise IT teams are under greater pressure to identify and block incoming cyber threats to ensure business operations and services are not interrupted — and AI-based cybersecurity provides a reliable way Read article >  ( 11 min )
  • Open

    Bundesliga Match Facts Shot Speed – Who fires the hardest shots in the Bundesliga?
    There’s a kind of magic that surrounds a soccer shot so powerful, it leaves spectators, players, and even commentators in a momentary state of awe. Think back to a moment when the sheer force of a strike left an entire Bundesliga stadium buzzing with energy. What exactly captures our imagination with such intensity? While there […]  ( 10 min )

  • Open

    What are some of the coolest AI use cases you've tested so far? AI tutoring is something that I actually found myself using almost daily
    submitted by /u/Playdonifps [link] [comments]
    how can we be sure AI won't rebel against humans in the future?
    basically the title, how can we be sure AI won't have self awareness and won't rebel against humans? submitted by /u/lilshoegazecat [link] [comments]
    Same Images In Runway Gen 2 From 3 Months Ago VS Now (default options)
    submitted by /u/SuspiciousPillbox [link] [comments]
    Teen boys use AI to make fake nudes of classmates, sparking police probe
    Teen boys at Westfield High School in New Jersey used AI image generators to create and share fake nude photos of female classmates, sparking a police investigation. The school believed the images had been deleted, but it remains unclear how many students were affected or if any disciplinary action was taken. There is currently no federal law restricting the creation of faked sexual images, but some states have passed laws to outlaw the distribution of faked porn. President Joe Biden has issued an executive order urging lawmakers to pass protections against generative AI producing child sexual abuse material. New Jersey may strengthen its laws to criminalize the creation and sharing of AI-faked nudes. Source : https://arstechnica.com/tech-policy/2023/11/deepfake-nudes-of-high-schoolers-spark-police-probe-in-nj/ submitted by /u/NuseAI [link] [comments]
    text to 3D in 10 seconds (Mickey, Miney, Bart, Tom Holland, Cristiano Ronaldo?) (workflow in comments)
    submitted by /u/PeePeePeePooPooPooo [link] [comments]
    Benchmarking machine learning frameworks
    submitted by /u/mfilion [link] [comments]
    We don't want it ~
    submitted by /u/Unreal_777 [link] [comments]
    LinkedIn just launched a new AI job coach for Premium members
    submitted by /u/thisisinsider [link] [comments]
    Harmonizing the Future: AI, Music, and Crypto Revolution
    submitted by /u/Einsof__ [link] [comments]
    Searching for an internship in AI for my bachelor thesis
    I am a Belgian student currently studying applied informatics with a specialization in AI. We learn everything from machine learning to generative AI, and have a focus on integrating these into actual solutions. Next semester I am required to do an internship from March 25th till June 19th. During this internship I am also required to work on and write my bachelor thesis. The main problem now is that there are very little companies that have contacted the school with internship positions related to AI. So I came here in the hopes that some of you may know companies that are willing to offer an internship position. Either in Belgium or an international company offering remote work. My preference goes out to something in research or innovative, but I am open to do any AI related work. If it is something I have little experience in I will learn. I will continue to search myself, but thanks in advance for any help! submitted by /u/ETS_Green [link] [comments]
    How can I generate accurate words and sentences in Midjourney?
    I’m currently using the pro version, though I’m extremely new to using it and have seen where you can add modifiers? Has anyone had any success with creating sentences or typography? If so do you have a method you use? submitted by /u/Maelasae [link] [comments]
    New Order - Blue Monday (AI music visualization)
    submitted by /u/glenniszen [link] [comments]
    Could Socratic Dialogue Evolve into a Hacking Technique for AI Systems?
    submitted by /u/utku1337 [link] [comments]
    Combination of each Star Wars trooper's helmet, from the Old Republic to the Final Order, into one
    submitted by /u/MomusVult [link] [comments]
    Convinced yet?
    submitted by /u/Philipp [link] [comments]
    What did humans lose by gaining intelligence?
    What did humans lose by gaining intelligence? submitted by /u/Virtual-Study-Campus [link] [comments]
    The art of color
    submitted by /u/Sea_Permit5660 [link] [comments]
    What is your approach to continuous testing and integration?
    If your answer is not below the given options, you can share in the comment section. I would appreciate your answers and suggestions. View Poll submitted by /u/Cygnet-Digital [link] [comments]
    Role Of AI In Business: Benefits And Challenges .
    AI is not a far-off pipe dream anymore; it has already become a precious resource for companies, helping them save time and reduce costs. However, there is still widespread confusion about how to effectively use Artificial Intelligence in businesses. Many are contemplating how to harness this technology for innovation, scalability, and improvement. If you are among those thinking this, then there’s no reason to look further because, in this piece, you are going to learn about it. Ready to dig deeper into these AI trends and understand how they’ll shape your industry in 2024? Dive into our blog for comprehensive insights. 👉 https://invozone.com/blog/ai-in-business/ submitted by /u/InvoZone [link] [comments]
    AI on weight gain from the 1950s to today
    Here's a progression timeline of the obesity epidemic, with a focus on quantifying weight gain: 1950s-1960s: - Initial Changes: During this period, the average American adult gained approximately 10 pounds compared to their counterparts from the early 1900s. - TV's Sedentary Effect: Hours of TV watching correlated with a slight uptick in average body weight. 1970s: - Fast Food's Caloric Boom: Regular consumption added an estimated 200-300 extra calories per day to many individuals' diets, leading to potential weight gains of 20-30 pounds a year if not offset by exercise. - Shift in Work: The move to sedentary jobs meant many adults were burning 100-200 fewer calories per day, leading to an additional potential weight gain of 10-20 pounds a year. 1980s: - Processed Food Surge: The averag…
    What do you guys expect from the OpenAI developer conference on November 6 ?
    I would guess some API access stuff, nothing more. submitted by /u/Mission-Length7704 [link] [comments]
    One-Minute Daily AI News 11/2/2023
    Shopify (SHOP.TO) has to prove to investors that its AI products will spark growth when it reports results on Thursday. Wall Street is expecting it to show revenue growth of 22.38% to $1.67 billion compared to last year, according to estimates from LSEG.[1] At a U.K. summit, 28 governments, including China and the U.S., signed a declaration agreeing to cooperate on evaluating the risks of artificial intelligence.[2] AMD’s MI300 Chips Projected to Make $1 Billion in Sales, Challenging Nvidia’s Dominance.[3] Scarlett Johansson demands AI app stop using her likeness in an ad without her permission.[4] Sources: [1] https://www.reuters.com/business/retail-consumer/shopify-merchants-seek-ai-boost-key-sales-decisions-2023-11-01/ [2] https://www.nytimes.com/2023/11/01/world/europe/uk-ai-summit-sunak.html [3] https://gameishard.gg/news/amd-rises-as-ai-chip-sales-prediction-bodes-well-for-rivalry-with-nvidia-by-reuters/535905/ [4] https://www.nbcnews.com/tech/scarlett-johansson-legal-action-ai-app-rcna123248 submitted by /u/Excellent-Target-847 [link] [comments]
    text to 3D
    submitted by /u/PeePeePeePooPooPooo [link] [comments]
  • Open

    UK AI Safety Summit 2023: Scaling up the Future
    submitted by /u/engaged_ape [link] [comments]
  • Open

    Zero-shot adaptive prompting of large language models
    Posted by Xingchen Wan, Student Researcher, and Ruoxi Sun, Research Scientist, Cloud AI Team Recent advances in large language models (LLMs) are very promising as reflected in their capability for general problem-solving in few-shot and zero-shot setups, even without explicit training on these tasks. This is impressive because in the few-shot setup, LLMs are presented with only a few question-answer demonstrations prior to being given a test question. Even more challenging is the zero-shot setup, where the LLM is directly prompted with the test question only. Even though the few-shot setup has dramatically reduced the amount of data required to adapt a model for a specific use-case, there are still cases where generating sample prompts can be challenging. For example, handcrafting…  ( 93 min )
  • Open

    [R] Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
    Paper: https://arxiv.org/abs/2310.17086 Abstract: Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement k iterations of Newton's method with O(k) layers. ​ https://preview.redd.it/i6hdcx1v60yb1.jpg?width=2036&format=pjpg&auto=webp&s=ed95fc0b625878ed88c3f36baa9ea3fb07430ff7 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [P] Get alerts when your AI fails (and get a gift card too!)
    Hey folks — I’m working on a platform that allows you to set up meaningful, automated tests and experiment tracking for your LLMs in just a few minutes (our tests go well beyond just the usual latency, token usage & cost stuff). If you’ve tried shipping LLMs and haven’t run into any issues with performance or trustworthiness, leave a comment on what your use-case is / how you did it! If not, here’s a free sign-up link to our app: https://app.openlayer.com. P.S. Will send a $50 amazon gift card your way if you’re interested in giving me additional feedback afterwards (30 min call). Just send me an email at [gabriel@openlayer.com](mailto:gabriel@openlayer.com). submitted by /u/byebaybay [link] [comments]  ( 9 min )
    [P] Benchmarking machine learning frameworks
    MLBench enables developers and maintainers to effortlessly gauge how their frameworks perform compared to other implementations, prior code versions, or across different boards, with respect to both runtime performance and other metrics. https://www.collabora.com/news-and-blog/news-and-events/benchmarking-machine-learning-frameworks.html submitted by /u/mfilion [link] [comments]  ( 9 min )
    [D] One LLM won't rule them all
    There isn't going to be one LLM to rule them all and here's why: https://generatingconversation.substack.com/p/one-llm-wont-rule-them-all submitted by /u/cgwuaqueduct [link] [comments]  ( 9 min )
    [D] Can somebody share their experience of attending ICML conference?
    I'm planning to attend ICML 2024 in person. Can somebody share their experience of attending the conference? Is it worth attending if you don't have any paper to present? If yes, how to get the most out of it? submitted by /u/cpluscplus [link] [comments]  ( 9 min )
    [D] Could it be advantageous to train/finetune some model (actually a LLM) feature by feature?
    This may be weird question, but I have tried looking for resources and haven't found anything at all. We are trying to train a classifier from some data using a pretrained LLM. The data consist of several features which are text, so we decided at some point to concatenate them in a string with the proper connectors and use as input the whole string. For instance, think about data of some product: "name", "manufacturer", "specs", etc. Then we create a string as "The product televisor from this manufacturer which have the following specs: ...". In this case the problem would be to decide the category of the product. For our particular case, our model makes some critical mistakes and it seems that they stem from not noticing that some features are critical (following the previous example, the manufacturer for instance). We were wondering if it would be beneficial to train the model first only using the substring corresponding to the feature that we think is more important and add features to the training little by little. I am a bit worried that this may lead to a bad local minimum and the model gets stuck there. Has anybody seen or done anything similar or has any reason why this would or wouldn't work? submitted by /u/soloetc [link] [comments]  ( 9 min )
    [R] GRACE: Discriminator-Guided Chain-of-Thought Reasoning
    TLDR: The paper proposes a decoding approach that improves multi-step (chain-of-thought) reasoning by using a discriminator to score and guide the generation of correct reasoning steps. It outperforms self-consistency and verifiers on various tasks and enhances both final answer accuracy and intermediate reasoning correctness. ​ Paper: arxiv.org/abs/2305.14934 Code: https://github.com/mukhal/grace/ ​ ​ submitted by /u/moyle [link] [comments]  ( 9 min )
    [D] From i5 10400 to i5 11400 or Another monitor for dual monitor set up
    My current CPU is i5 10400. For ML will i5 10400 bottleneck rtx 3060 12gb? heard it will because that i5 supports only 3rd Gen, then should I upgrade to i5 11400 just to get the support of Gen 4 or instead I should just buy another monitor for dual monitor set up? submitted by /u/speed-speed [link] [comments]  ( 9 min )
    [D] benefits of using only attention weights for LoRA
    I'm confused as to why people would only use a subset of weights as learnable parameters for LoRA. If you are only using attention weights as update params, you still need the decomposed weights for the other layers to get the derivative of the loss with respect to the attention weights. That's how the chain rule works, so I don't see how it would help with memory consumption. Is there something I'm missing here? submitted by /u/skelly0311 [link] [comments]  ( 9 min )
    [D] Unsupervised feature selection
    Hello, I am doing a simulation study to see how associations change when aggregating spatial data. I have a large number of continuous exposure variables to choose from (>100). Before creating my models, I want to choose a group of meaningful exposures that limit collinearity (maybe ~10). The outcome variable will be simulated, and I want to choose the exposures without regard to the outcome (therefore I believe this is unsupervised learning). What is the best way to select these features? How many features should I select? Thank you! submitted by /u/DefinitelyAmNotOP [link] [comments]  ( 9 min )
    [D] Thoughts on Masked Language Modeling Objective and Corrupted Spans for Causal LM's?
    Google has published a number of papers showing their increasing affinity for what was originally called a masked language modeling objective and has now become corrupted spans. I first saw this in the T5 paper Later in the UL2 paper "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?" And finally in Transcending Scaling Laws with 0.1% Extra Compute However, I don't see this mentioned in this sub or in any really in any discussions on LLM training, nor do I find documentation for it in any LLM training frameworks. Is this something any of you have used? Are you aware of any open source tools or code examples of this? One thing that confuses me is that these papers all discuss Prefix-LMs as being important in this process, though I'm unsure what they mean. I know a Prefix-LM has a portion of it's inputs with bi-directional attention, but it's not something any actual models do as far as I can tell. But, in the "Transcending Scaling Laws" paper it seems like they take PaLM training checkpoints and finish off the pre-training with their corrupted span method by converting PaLM to a Prefix-LM. They say: We train U-PaLM using the prefix language model (PrefixLM) architecture, also sometimes known as a non-causal decoder-only model. The PrefixLM architecture keeps a non-causal mask in its prefix (or inputs) and applies bidirectional attention to input tokens. Is it really possible to take a causal LM and turn it into a prefix LM (partially bi-directional) by just changing the attention mask with no impact on the learned weights? Or do they mean they are training on completions only as described in the hugging face documentation submitted by /u/elbiot [link] [comments]  ( 9 min )
    [D] Imputing the Testing data after combining the Training and Validation datasets
    Hello, I imputed my training data's missing values with the mean of each column, but my question is, after combining the training and validation datasets and re-training the model, when we go on to test the model on the testing dataset, should we impute the testing data's missing values with the old training data's mean values, or the newly combined (train and validation) dataset's means? submitted by /u/CrunchyMind [link] [comments]  ( 9 min )
    [Research] Detecting Annotation Errors in Semantic Segmentation Data
    Would you trust medical AI that’s been trained on pathology/radiology images where tumors/injuries were overlooked by data annotators or otherwise mislabeled? Most image segmentation datasets today contain tons of errors because it is painstaking to annotate every pixel. Example of bone shard not labeled properly. After substantial research, I'm excited to introduce support for segmentation in cleanlab to automatically catch annotation errors in image segmentation datasets, before they harm your models! Quickly use this new addition to detect bad data and fix it before training/evaluating your segmentation models. This is the easiest way to increase the reliability of your data & AI! We have feely open-sourced our new method for improving segmentation data, published a paper on the research behind it, and released a 5-min code tutorial. You can also read more in the blog if you'd like. submitted by /u/cmauck10 [link] [comments]  ( 9 min )
    [D] AAAI 2024 Reviews
    Reviews are out on CMT. What did you get? submitted by /u/TheTeoz [link] [comments]  ( 8 min )
    [R] help with RFE in R
    Hi everyone, I’m working on a bioinformatics project for school and I’m having some trouble doing the RFE in R. I’ve got an affymetrix gene expression set that I’ve done preprocessing on and I’m at the point where I’m ready to do RFE but I’m struggling writing the code for this (my first iteration took 30 hours to run and I screwed it up still… gave me my sample names as top variables instead of my probe sets). With that being said, the goal is to mine a bio marker from the affymetrix data through SVM&RF RFE. I’d also like to investigate different bio marker sizes (2 probe sets, 50, 100, all the way up to the max number of probes I have) I’m a biochemist by education got thrown into R and am struggling to learn it. I will say, it has been kinda fun tho. Thank you in advance, if any of you are in the central CT, USA region I’ll buy ya a beer! — rusty submitted by /u/RustyShackleford2677 [link] [comments]  ( 9 min )
    [D] A question about Nvida Nemo + Getty Images
    Getty is pitching their new Generative AI tool in our company, it's based on Nvidia Nemo and their "unique selling point" is that the model is fully trained with Getty images. This sounds a bit odd to me, but I'm not sure if it is actually possible to have a model only trained with proprietary images. The legal team assumes this as a true statement but I still think this is some sort of finetuned version of NVIDIA Nemo model. Does any of you have any clue on where to look to make sure we are not being a bit too gullible. Thank you in advance. submitted by /u/legado [link] [comments]  ( 9 min )
    [D] Open Source Machine Learning Projects
    Hey, So i consider myself as a beginner in ML and i was wondering if there are open source projects i could contribute to, to apply my knowledge and get a real world experience of how these things are built an. The only projects I could find are labs or libraries and not products used in real life. Does any one have any experience contributing to open source projects? submitted by /u/parvpareek [link] [comments]  ( 9 min )
    [R] Facebook Research archived hand-related repos
    I find that recently, Facebook Research archived (at lease some) their repos about hand-related research. Even very recent research: https://github.com/facebookresearch/PressureVision Or very popular ones: https://github.com/facebookresearch/ContactPose Does anybody have any idea why this happend? Or is it a mistake? submitted by /u/Crow-Scare [link] [comments]  ( 9 min )
    [D] Tools for creating ASR datasets
    [D] please can anyone recommend tools for creating ASR datasets? submitted by /u/afrodata [link] [comments]  ( 9 min )
    Separate model for outliers [R] [outliers] [regression]
    Hey Here is a situtation i found myself. There is a dataset with around 4K observations with near 10% outliers in target variable. Transformations like log, box-cox, winsorize didn't work out. Robust regression approaches didn't help either.Model performance metrics with and wo those outliers are 4x worse. Note Just removing those observations is not considered Here is my plan. Build another model aimed to detect the outliers first, model for "regular" observations and model for outliers. Does it make sense and i'm not overcomplicating? What is the common approach in such cases? Any help and ideas are highly apreciated. Thanks in advance submitted by /u/No_Purchase8883 [link] [comments]  ( 9 min )
    [D] AAAI 24 Reviews
    Creating this thread in anticipation of the upcoming Phase 2 reviews and results. If there already is a thread, please share! ​ submitted by /u/tallguyfromstats [link] [comments]
    [D] Career Path for pursuing job in Machine Learning and Operations Research
    I am a final-year Industrial Engineering Masters student at Purdue University. My major areas of interest are Machine Learning and Operations Research. I have an option to pursue a Dual Master with other Masters in Electrical and Computer Engineering. Should I choose that path to get better job opportunities ? Will the other degree help me in the long term? [D] submitted by /u/pulkit_mundra [link] [comments]  ( 9 min )
  • Open

    2023-24 Takeda Fellows: Advancing research at the intersection of AI and health
    Thirteen new graduate student fellows will pursue exciting new paths of knowledge and discovery.  ( 14 min )
    Generating opportunities with generative AI
    Rama Ramakrishnan helps companies explore the promises and perils of large language models and other transformative AI technologies.  ( 10 min )
  • Open

    Deploy ML models built in Amazon SageMaker Canvas to Amazon SageMaker real-time endpoints
    Amazon SageMaker Canvas now supports deploying machine learning (ML) models to real-time inferencing endpoints, allowing you take your ML models to production and drive action based on ML-powered insights. SageMaker Canvas is a no-code workspace that enables analysts and citizen data scientists to generate accurate ML predictions for their business needs. Until now, SageMaker Canvas […]  ( 6 min )
    Develop generative AI applications to improve teaching and learning experiences
    Recently, teachers and institutions have looked for different ways to incorporate artificial intelligence (AI) into their curriculums, whether it be teaching about machine learning (ML) or incorporating it into creating lesson plans, grading, or other educational applications. Generative AI models, in particular large language models (LLMs), have dramatically sped up AI’s impact on education. Generative […]  ( 8 min )
  • Open

    How Are Foundation Models Used in Gaming?
    AI technologies are having a massive impact across industries, including media and entertainment, automotive, customer service and more.  ( 8 min )
    GeForce NOW-vember Brings Over 50 New Games to Stream In the Cloud
    Gear up with gratitude for more gaming time. GeForce NOW brings members a cornucopia of 15 newly supported games to the cloud this week. That’s just the start — there are a total of 54 titles coming in the month of November. Members can also join thousands of esports fans in the cloud with the Read article >  ( 8 min )
  • Open

    What architecture for vision-based RL?
    Hello dear community, Someone has just asked me this question and I have been unable to provide a satisfactory answer, as in practice I have been using very simple and quite naive CNNs for this setting thus far. I think I read a couple papers a while back that were advocating for specific types of NNs to deal with vision-based RL specifically, but I forgot. So, my question is: what are the most promising NN architectures for pure vision-based (end-to-end) RL according to you? Thanks :) submitted by /u/yannbouteiller [link] [comments]
    Comparing RL vs LLM Prompting for Game Playing AI
    Hey people, I wanted to share a video from my ML YouTube channel discussing the state of the art methods for game playing AI systems post the LLM boom. Of course, this space was dominated by Reinforcement Learning for most of the 2010s, but there has been some interesting work towards using LLMs solo or as an “RL assistant” to train better RL agents. Some of the papers I talked about in the video seem to indicate that LLMs can guide RL exploration at the start of training to drastically improve sample efficiency. Here’s my video breaking down the complex prompting systems that let LLMs like GPT4 play Minecraft-like open world games and reflect on their progress. Hope people who are interested find it worthwhile… submitted by /u/AvvYaa [link] [comments]
  • Open

    A disk around Paris
    The other day I saw an image of a large disk centered on Paris subjected to the Mercator projection. I was playing around in Mathematica and made similar images for different projections. Each image below is a disk of radius 4200 km centered on Paris (latitude 49°, longitude 2°). All images were produced with the […] A disk around Paris first appeared on John D. Cook.  ( 5 min )
  • Open

    A Video Game that Pays: Lessons Learned from Working Remotely
    The original tweet by Sahil Lavingia Some time ago, I stumbled upon this amusing tweet from @shl. I love how true this statement is, although it feels very wrong to admit it. Let’s think about the day in the life of a software engineer: You interact with people from all corners of the world. You complete tasks on different online platforms (Slack, GitHub, VS Code, Google Docs, etc.). You have a list of your main quests to complete (you can find them in your journ… Kanban board!) There are also side-quests to take care of (“Hey Damian, could you take a look at this bug?”). Sometimes you can gang up with your teammates to slay a big beast (“Hey, I have this nasty bug that I have been working on for the past few days. I know you have better knowledge of this particular part of the repositor…  ( 8 min )

  • Open

    [D] Self Attention from First Principles (A video)
    Hello guys, I wanted to post a video I have been working on for my Deep Learning YT channel about Self Attention and Masked Self Attention. In the video, I tried to explain the essence of self-attention in an intuitive manner, describe how it works in practice, why it works how it works, its various strengths and applications… I’m kinda excited to share the video here for those that are interested. https://youtu.be/4naXLhVfeho submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [R] Zephyr: Direct Distillation of LM Alignment - state-of-the-art for 7B parameter chat models
    Paper: https://arxiv.org/abs/2310.16944 GitHub: https://github.com/huggingface/alignment-handbook Hugging Face: https://huggingface.co/collections/HuggingFaceH4/zephyr-7b-6538c6d6d5ddd1cbb1744a66 X thread: https://twitter.com/Thom_Wolf/status/1717821614467739796 Abstract: We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at this https URL. ​ https://preview.redd.it/4y355lxv0txb1.jpg?width=1200&format=pjpg&auto=webp&s=76e7b8a2ff06e39e9189712a42b1e349423b5d3d ​ submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] how to understand structural equivalence discovered by Node2Vec?
    Hi, I am new to graph neural networks and have some issue understanding structural equivalence discovered by Node2Vec. For instance, given following plot generated by Node2Vec visualization (taken from https://towardsdatascience.com/complete-guide-to-understanding-node2vec-algorithm-4e9a35e5d147): ​ https://preview.redd.it/kqjiiywelsxb1.png?width=500&format=png&auto=webp&s=6ad9d86f6f0a6dbe1a5d641272a5956d110c8770 They claim that if we encourage BFS search in Node2Vec, then we are more likely to discover structural equivalence patterns in the bottom plot. What I do not quite get is if we encourage BFS, would discovered embeddings be more "local"? If so, why those distant nodes would have similar colors/embeddings? Any idea would be much appreciated, thanks! submitted by /u/Illustrious-Pay-7516 [link] [comments]  ( 9 min )
    [D] Need Help About Height Map Analysis
    As mentiones in the title, I need help with height map analysis. I want to create an artificial intelligence that can analyze the height map given as an input and mark areas such as mountainous areas, river beds, plains, rocks, cliffs and other geographical details on the map. Is there any advice you can give me or could you provide suggestions which will help me to move forward? I really am insterested in this project and want to work on it. Also, if you know any examples and/or studies related to this subject, can you please share them? I am looking forward to discovering more information regarding this topic. Edit: Spelling errors submitted by /u/PlayerWell [link] [comments]  ( 9 min )
    [D]ai that can remove emojis/spiderman filter from face
    recently i saw an ai where it said it can remove emoji which was obviously fake but it made me curious is there an ai that can do this? because i think ai cant do it is because the face does not exist in the image so ai have no way of showing the face behind emoji submitted by /u/randomaccimade69 [link] [comments]  ( 9 min )
    [D] With LLMs hallucinating nature, how do we create a credible production ready application?
    I want to use LLMs to automate analysing data and use it to provide insights to my users, but often times I notice insights being generated on factually incorrect data. I tried fine tuning my prompts, the structure in which I pass data to LLM, few shot learning but there still some chance of it to hallucinate. How can I create a production ready application where this insights are surfaced to end users and presenting incorrect insights is not accepted? I am out of ideas. Any guidance is appreciated 🙏🏻 submitted by /u/software-n-erd [link] [comments]  ( 9 min )
    [D] professionally, is data collection carried within a notebook?
    I am building my first model and I am about to start the data collection process, building a csv dataset, but I do not know whether to do this within a python script, the jupyter notebook where i will write the model, or a separate notebook. I have tried researching this but I have not found a concise answer to my question. Thanks in advance. submitted by /u/obvslynot [link] [comments]  ( 9 min )
    [D] What do y'all think about Biden's new AI regulation?
    Would it stifle the opensource development and new AI startups and only benefit the established big tech companies? It's kinda vague and I couldn't understand in my first read but does the executive order lacks teeth? That is if the guideline is not followed closely can government do anything to opensource community or new startups? limk to one of articles (no paywall) : https://www.reuters.com/technology/white-house-unveils-wide-ranging-action-mitigate-ai-risks-2023-10-30/ submitted by /u/ColumbiaGSAlum [link] [comments]  ( 9 min )
    [P] LLM-VM: Fine-tune anywhere & avoid big models.
    I wanted to share a project I’ve been working on, LLM-VM. It’s a community-first, open-source tool designed to enhance the efficiency of fine-tuning and inference for large language models (LLMs) both locally and in cloud environments. At its core, LLM-VM implements recursive synthesized distillation with automatic task discovery. This means it can iteratively refine training data and model parameters, aiming to optimize model performance with less computational overhead. Our goal with LLM-VM is to provide a practical and accessible platform for researchers and developers. By facilitating more efficient model training and deployment, we hope to contribute to the broader machine learning community’s efforts in advancing language model capabilities. I’d love to get your feedback, contributions, or any thoughts you might have. Let’s collaborate to push the boundaries of what we can achieve with LLMs! Cheers! submitted by /u/mmirman [link] [comments]  ( 9 min )
    [P] LangCheck: a multi-lingual toolkit to evaluate LLM applications
    Hi! I wanted to share LangCheck, an open source toolkit to evaluate LLM applications (GitHub, Quickstart). It already supports English and Japanese text, and more languages soon – contributions welcome! Core functionality: langcheck.metrics – metrics to evaluate quality & structure of LLM-generated text langcheck.plot – interactive visualizations of text quality langcheck.augment – text augmentations to perturb prompts, references, etc (coming soon) Super open to feedback & curious how other people think about evaluation for LLM apps. submitted by /u/kennysong [link] [comments]  ( 9 min )
    [D] Data Pipelines for Data Products
    Data pipelines are one of the key components of an ML product. Creating value from different resources only makes sense when it is available to the consumers. In the article, you will explore the most important elements of a data pipeline that fulfils the data product needs, and you will get practical guidelines to incorporate in your use cases. Here's the article: https://moderndata101.substack.com/p/data-pipelines-for-data-products submitted by /u/growth_man [link] [comments]  ( 9 min )
    [Research], [R], [Project], [P] Dataset for evaluate algorithms on sensor calibration
    Hi everyone. I should evaluate the precision and accuracy of some machine learning algorithms to calibrate a sensor. In particular I should compare these algorithms and choose the best one to calibrate a sensor, to proceed to obtain a transfer function. I did various searches on known sites and repositories but found very little. In particular, I would need datasets from 3 different sensors, in order to test the various algorithms on different sensors. Therefore each dataset must have the data collected by the sensor and the target data of the sensor itself, so that it can be calibrated and to be able to evaluate the calibration algorithms. Can you help me? submitted by /u/Calosss22 [link] [comments]  ( 9 min )
    [N] Webinar - Enable and manage Vector Search in MongoDB Atlas with SuperDuperDB
    Hi Community! Today we are hosting a hands-on "Enable and manage Vector Search in MongoDB Atlas with SuperDuperDB Webinar": https://www.eventbrite.com/e/enable-and-manage-vector-search-in-mongodb-atlas-with-superduperdb-webinar-tickets-744936223297 The following questions will be answered in the workshop: · What is vector search and why is it so important? · What are vector databases? · Why is it a huge advantage to use vector search with MongoDB Atlas instead of a vector database? · What embedding models are there? · How do I use these models to generate vector embeddings for my data? · How do I perform vector search? · What AI applications can I build on top of vector search? When? Wednesday, November 1st 12pm - 1pm ET (Eastern Standard Time) Add to Google · Outlook · iCal · Yahoo submitted by /u/Sevyten [link] [comments]  ( 9 min )
    [D] Rewrite with many style using AI
    Hi everyone. Recently I have done some research to build a writing AI tool such as Quillbot. I see them do really amazing in paraphrase. I am currently trying to learn and research some articles to create a Deep learning model that can change many different writing styles and allow users to customize it themselves. I tried starting with small models and a few pretrained models. I find that models are trained to specialize in only one writing style or the models simply rewrite without regard to the specific style. Next, I tried LLM models like mistral 7B or Llama-2 and using the results, I self-assessed them to be somewhat better. However, I want more than that. Is there a way to create many different writing styles using only one model and we can scale it with more styles? and whether anyone has applied it to real products. If you can, could you suggest me some more related research? submitted by /u/HughLee_1999 [link] [comments]  ( 9 min )
    [R] LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LLM-Generated Texts
    [arXiv] https://arxiv.org/abs/2310.20501 [Abstract] Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that …
    [P] Numpy / Numba implementation of IVF/PQ ANN index which is as fast as faiss
    Hi, just sharing with my recent github project fast-ivf in which I implemented from scratch IVF (+PQ) index using purely numpy/numba libraries. I did it mostly for an educational purpose. I also implemented something which I called `CompressedFastIVF` index which trains auto shallow autoencoder using kmeans assignments to reduce the dimensionality of the source embeddings, which seems to be a nice alternative for PQ method, at least on my data. Surprisingly, when I compared my implementation with the faiss library, I got 10 times speed up on my custom dataset with about 900k vectors of size 1024 when using simple IVF index. There are probably multiple reasons for that e.g. I use numpy with mkl library for algebraic operations, different implementation for kmeans (which result in different clusters sizes distribution) etc. When I tested it on other publicly available datasets, the speedup is much slower, which showed me that we should always test ANN libraries on the target data. ​ ​ ​ submitted by /u/kmkolasinski [link] [comments]  ( 9 min )
    [D] Machine Learning in Health
    I would like to know if someone currently works or has worked as a machine learning engineer in the field of medical science / health and if so, I would like to know about their experiences. The background is that I got the possibility to either work in the medical field or robotics and I can't really decide and thus looking for some input. I am most curious about what you did in your work and if it felt fun / rewarding. Thanks a lot! submitted by /u/Numerous_Talk7940 [link] [comments]  ( 9 min )
    [D] [P] ML Forecasting of Satellite Imagery and Handling Missing Data (Clouds)? [P]
    Currently working on a personal project using satellite imagery to identify and generate a forecast of algae on the oceans surface. Generally, using an algorithm to identify times/locations of algae on the surface and an ML model to generate a forecast. I have a method of identifying the algae down (not ML), where my output is a binary raster of where the algae is/not on a given day/period, which would be my target data. I also have plenty of possible predictor data (also rasters) to use in a model. In the past, the only ML I've done with raster data is supervised classification with images and their labels. This means I am having trouble even finding a starting point for this part of my project. Another issue I know will be a complication is that in both my target and predictor data, there are areas of missing values, where clouds were in the way of the sensor. I am hoping I can find a model that can handle missing values to avoid imputation. I am quite new to the ML space, and have very little experience using raster data in ML. Just hoping for some ideas or places to start reading up on possible methods dealing with similar issues. Any help is appreciated. submitted by /u/DisgustedApe [link] [comments]  ( 9 min )
    [D] [P] Bitnet in Pytorch or Jax
    I was interested by the new bitnet paper https://arxiv.org/pdf/2310.11453.pdf, and was wondering if there was any way to use the 1 bit (1 or -1) in actual practice and how? More specifically, I know that you can do this with cuda (which I don't have any experience with) but it would be much better if there was a way to do this on a TPU (Jax?). Any implementation I've seen so far just pretends like they are using 1 bit but representing it with higher precisions. submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [D] OmniAI - ETL for AI applications
    Hey r/MachineLearning, I'm one of the founders of OmniAI. We just got accepted into YC's W24 batch, and we’re super excited on simplifying AI data workflows. OmniAI is a data infrastructure layer for AI. We’re syncing a company's data into a central warehouse that's optimized for AI interactions (vectorized, graph relations, etc.). Models can be run directly on that warehouse and kept up to date with your business intelligence. We'd love to hear about any pain points you are having with vector databases. Any insight/feedback is very much appreciated 😊 submitted by /u/travelingladybug23 [link] [comments]  ( 9 min )
  • Open

    UK, US, EU and China sign declaration of AI's 'catastrophic' danger
    The UK, US, EU, and China have signed the Bletchley declaration, acknowledging the potential catastrophic risks posed by artificial intelligence (AI). The declaration does not establish an international testing hub in the UK but sets a template for future collaboration. The signatories recognize the potential for serious harm from AI models and agree on the urgency of understanding the risks. The UK Prime Minister and the UK Technology Secretary welcomed the declaration, emphasizing the need for collective action in addressing the risks of frontier AI. The declaration marks a diplomatic success for the UK, which hosted the AI safety summit. There is little international agreement on global AI regulations or who should develop them. The US announced the creation of a separate American AI Safety Institute, while the EU is in the process of passing an AI bill. The UK government plans to properly understand the problem before applying solutions and denies falling behind international counterparts. Source : https://www.theguardian.com/technology/2023/nov/01/uk-us-eu-and-china-sign-declaration-of-ais-catastrophic-danger submitted by /u/NuseAI [link] [comments]  ( 9 min )
    MASTER CHIEF vs BLACK PANTHER | AI Multi-VS
    This project uses AI such as Chat GPT-4, Eleven Labs, D-ID, & Midjourney to simulate a Virtual AI Co-Host of Cortana from the Halo Franchise. Cortana is fully voiced, modeled, & lip sank to simulate an actual artificial intelligence evaluation on Duels between characters in all Media Universes. submitted by /u/AcanthisittaCheap914 [link] [comments]  ( 9 min )
    What data/dataset would you love to have for your AI project?
    As the title say, what dataset are you looking for but find it difficult to acquire for your AI/ML project/business? Also explain what you're trying to build and how it can be useful! submitted by /u/nobilis_rex_ [link] [comments]  ( 9 min )
    Need Help About Height Map Analysis
    As mentiones in the title, I need help with height map analysis. I want to create an artificial intelligence that can analyze the height map given as an input and mark areas such as mountainous areas, river beds, plains, rocks, cliffs and other geographical details on the map. Is there any advice you can give me or could you provide suggestions which will help me to move forward? I really am insterested in this project and want to work on it. Also, if you know any examples and/or studies related to this subject, can you please share them? I am looking forward to discovering more information regarding this topic. submitted by /u/PlayerWell [link] [comments]  ( 9 min )
    Microsoft starts selling AI tool for Office, which could generate $10B/y by 2026
    Microsoft has started selling its artificial intelligence tool, Copilot, as an add-on to Office productivity software subscriptions. The tool appears in Word, Excel, and other Office programs and is priced at $30 per person per month. Piper Sandler analysts estimate that Copilot could generate over $10 billion in annualized revenue by 2026. Microsoft aims to leverage its dominant position in the productivity software market, while Google is selling its own AI enhancement for Workspace tools. Piper Sandler's model assumes that 18% of eligible users will use Copilot, driven by a fear of missing out (FOMO) element. Companies without Copilot may be at a disadvantage in competitive industries. Microsoft CEO Satya Nadella stated that customers who use Copilot find it indispensable. Microsoft has initially targeted the largest companies for Copilot adoption, with 40% of Fortune 100 companies already using it in an invitation-only paid early-access program. While there is limited data on Copilot's performance, organizations are encouraged to experiment with generative AI, which can create synthetic images and text with minimal human input. Microsoft faces the challenge of expanding Copilot adoption beyond a small core of end users to achieve wide deployment. Analysts suggest that Copilot could be distributed to highly paid executives to help prioritize email messages and understand documents, but caution that technically savvy employees familiar with generative AI may be better suited for early adoption. Microsoft may also benefit from companies using additional Azure cloud services, such as Purview for data management, during the setup of Copilot. Source : https://www.cnbc.com/2023/11/01/microsoft-365-copilot-becomes-generally-available.html submitted by /u/NuseAI [link] [comments]  ( 9 min )
    17 AI tools for Marketing
    submitted by /u/Senior_tasteey [link] [comments]  ( 8 min )
    Is there a way to implement a bunch of ChatGPT Retrieval Plugin + Nougat: Naturals Optical Understanding for Academic Documents + Vector database?
    Please tell me how I can correctly transfer all my books and textbooks and documents in PDF format to a vector database while preserving the layout structure and equations? Maybe some of the people implemented this idea using Nougat: Neural Optical Understanding for Academic Documents (https://facebookresearch.github.io/nougat /)? If so, I ask you to say a few words about how you did it. ​ And let me ask you another question: how exactly does the ChatGPT Retrieval Plugin help you in the process of solving problems? Will it be possible to use it to extract information from your vector database during the ChatGPT dialog? ​ I am grateful in advance for the answers. submitted by /u/Imunoglobulin [link] [comments]  ( 9 min )
    AI Tools Blur the Line Between Marketing Strategy and Tactics for Startups
    AI tools have become popular in marketing strategies for startups, allowing for faster execution and content creation. However, there is a concern that these tools may blur the line between strategy and tactics, leading to a lack of success stories from startups. Startups should have a solid strategy in place before relying solely on AI tools for marketing. AI tools can be valuable for solopreneurs and provide guidance and assistance in bouncing ideas off. Startups should view AI tools as a compass rather than a magic wand, guiding them along the way to their strategic goals. Using ChatGPT as an example, providing custom instructions and asking questions can lead to better marketing decisions. Source : https://www.erwanderlyn.com/p/chat-gpt-for-marketing-strategy submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Silly, I had the Terminator from Terminator 2 play Zork as the Terminator. I'm still learning.
    submitted by /u/notlikelyevil [link] [comments]  ( 8 min )
    Analysis of AI Risk Discourse - 'AI Risk: An Illusion of the Future?'
    submitted by /u/LaVolpe223 [link] [comments]  ( 8 min )
    Dilemma
    submitted by /u/Sea_Permit5660 [link] [comments]  ( 8 min )
    One-Minute Daily AI News 10/31/2023
    The contribution of generative artificial intelligence to global gross domestic product is now expected to be higher within the next 10 years as adoption of the emerging technology is expected to grow, Goldman Sachs has said. NVIDIA has unveiled a custom large language model, the technology on which artificial intelligence tools like ChatGPT are based, which the company has developed for their internal use. Trained on NVIDIA’s proprietary data, “ChipNeMo” will generate and optimize software and provide assistance to human designers in building semiconductors.[2] IBM Launches Generative AI Coding Assistant “watsonx” for Mainframe Modernization.[3] The F.D.A. has approved many new programs that use artificial intelligence, but doctors are skeptical that the tools really improve care or are backed by solid research.[4] Sources: [1] https://www.thenationalnews.com/business/technology/2023/10/31/generative-ais-economic-contribution-likely-to-rise-goldman-sachs-says/ [2] https://research.nvidia.com/publication/2023-10_chipnemo-domain-adapted-llms-chip-design [3] https://winbuzzer.com/2023/10/30/ibm-launches-generative-ai-coding-assistant-watsonx-for-mainframe-modernization-xcxwbn/ [4] https://www.nytimes.com/2023/10/30/health/doctors-ai-technology-health-care.html submitted by /u/Excellent-Target-847 [link] [comments]  ( 9 min )
  • Open

    MetNet-3: A state-of-the-art neural weather model available in Google products
    Posted by Samier Merchant, Google Research, and Nal Kalchbrenner, Google DeepMind Forecasting weather variables such as precipitation, temperature, and wind is key to numerous aspects of society, from daily planning and transportation to energy production. As we continue to see more extreme weather events such as floods, droughts, and heat waves, accurate forecasts can be essential to preparing for and mitigating their effects. The first 24 hours into the future are especially important as they are both highly predictable and actionable, which can help people make informed decisions in a timely manner and stay safe. Today we present a new weather model called MetNet-3, developed by Google Research and Google DeepMind. Building on the earlier MetNet and MetNet-2 models, MetNet-3 p…  ( 94 min )
  • Open

    Variance reduction technique proof?
    submitted by /u/Massive_Cup_4458 [link] [comments]  ( 9 min )
    Newbie here - trying to make crude image stabilization using RL
    Hi there! I recently dived into the topic of RL after finishing a neural networks course at my university. I have basic understanding of the underlying principles of CNNs, the general idea of what Reinforcement Learning is, and I'm trying to learn more by making a project. The problem that I came up with is as follows: The system receives consecutive images frame after a frame (let's assume the frames come from a prerecorded video of stationary object, but the cameraman's hand is shaking/moving slowly) and tries to compute offsets for them that allow the user to align new frames to the original one. My idea is to use RL to train a network to recognize how the current frame is offset from the original (initial, first frame fed to the network) to allow some other software or even the us…  ( 10 min )
    Invalid Action Masking when action space is continuous
    Background: I am relatively new to RL, so apologies if this post comes off as repititive or the solution is immediately obvious. But to my knowledge, I couldn't find any examples online for my use-case hence posting here. The problem I'm trying to solve is a relatively simple one. Let's say that my action space has only two variables x1 & x2, both continuous (Box). I want my valid actions to be only those where x1 + x2 < k where k is some constant. So I decide to use invalid action masking because that's the most eficient solution. But all I see online is examples of invalid action masking when actions are categorical. Even the ray discussion forums say that action masking for continuous action spaces isn't something they have done. Has anyone done this before? Any resources that you can point towards? submitted by /u/Solitary_Walker [link] [comments]
  • Open

    Dialogue-guided visual language processing with Amazon SageMaker JumpStart
    Visual language processing (VLP) is at the forefront of generative AI, driving advancements in multimodal learning that encompasses language intelligence, vision understanding, and processing. Combined with large language models (LLM) and Contrastive Language-Image Pre-Training (CLIP) trained with a large quantity of multimodality data, visual language models (VLMs) are particularly adept at tasks like image captioning, […]  ( 16 min )
    How Reveal’s Logikcull used Amazon Comprehend to detect and redact PII from legal documents at scale
    Today, personally identifiable information (PII) is everywhere. PII is in emails, slack messages, videos, PDFs, and so on. It refers to any data or information that can be used to identify a specific individual. PII is sensitive in nature and includes various types of personal data, such as name, contact information, identification numbers, financial information, […]  ( 8 min )
  • Open

    Using classical statistics to avoid regulatory burden
    On June 29 this year I said on Twitter that companies would start avoiding AI to avoid regulation. I followed that up with an article Three advantages of non-AI models. The third advantage I listed was Statistical models are not subject to legislation hastily written in response to recent improvements in AI. The chances that […] Using classical statistics to avoid regulatory burden first appeared on John D. Cook.  ( 5 min )
    Executive order on differential privacy
    This week President Biden signed a long, technically detailed executive order that among other things requires the Secretary of Commerce to look into differential privacy. Within 365 days of the date of this order … the Secretary of Commerce … shall create guidelines for agencies to evaluate the efficacy of differential-privacy-guarantee protections, including for AI. […] Executive order on differential privacy first appeared on John D. Cook.  ( 6 min )
    Differential entropy and privacy
    Differential entropy is the continuous analog of Shannon entropy. Given a random variable X with density function fX, the differential entropy of X, denoted h(X), is defined as where the integration is over the support of fX. You may see differential entropy defined using logarithm to a different base, which changes h(X) by a constant […] Differential entropy and privacy first appeared on John D. Cook.  ( 5 min )
  • Open

    Turing’s Mill: AI Supercomputer Revs UK’s Economic Engine
    The home of the first industrial revolution just made a massive investment in the next one. The U.K. government has announced it will spend £225 million ($273 million) to build one of the world’s fastest AI supercomputers. Called Isambard-AI, it’s the latest in a series of systems named after a legendary 19th century British engineer Read article >  ( 6 min )
    Unlocking the Power of Language: NVIDIA’s Annamalai Chockalingam on the Rise of LLMs
    Generative AI and large language models are stirring change across industries — but according to NVIDIA Senior Product Manager of Developer Marketing Annamalai Chockalingam, “we’re still in the early innings.”  In the latest episode of NVIDIA’s AI Podcast, host Noah Kravitz spoke with Chockalingam about LLMs: what they are, their current state and their future Read article >  ( 5 min )
  • Open

    Exclusive Invitation: Join My Talk on AI-Bots This Morning!
    Hey Friend,  ( 7 min )
  • Open

    TFGNN: Tensorflow GNN
    Does anyone have any experience with using the tensorflow/gnn library for training and testing neural networks on graph data. Github:Tensorflow/gnn submitted by /u/Choice-Secret-99 [link] [comments]
    Recommended object tracking (deep) models
    Hi, I already have a pretty good net that detects the object I look for in a single frame. What architectures are there for turning all these single-frame predictions into an object tracking algorithm? ​ submitted by /u/jonathan923_ [link] [comments]

  • Open

    TMLR Paper “Conformal Prediction under Ambiguous Ground Truth”
    Conformal prediction uses a held-out, labeled set of examples to calibrate a classifier to yield confidence sets that include the true label with user-specified probability. But what happens if even experts disagree on the ground truth labels. Commonly, this is resolved by taking the majority voted label from multiple expert. However, in difficult and ambiguous tasks, the majority voted label might be misleading and a bad representation of the underlying true posterior distribution. In this paper, we introduce Monte Carlo conformal prediction which allows to perform conformal calibration directly against expert opinions or aggregate statistics thereof. The post TMLR Paper “Conformal Prediction under Ambiguous Ground Truth” appeared first on David Stutz.  ( 4 min )
  • Open

    Macbook Pro M3 for LLMs and Pytorch? [D]
    My current PC laptop is soon ready to retire, having worked for seven years. As a replacement I'm considering the new Macbook Pros. It is mainly the battery time which makes me consider Apple. These are my requirements for the laptop: great battery time 16" since I'm old and my eyes are degraded dual external monitors software engineering including running some local docker images Then I have two ML requirements which I don't know if I could fulfill using a laptop: good performance for working with local LLMs (30B and maybe larger) good performance for ML stuff like Pytorch, stable baselines and sklearn In order to fulfill the MUST items I think the following variant would meet the requirements: Apple M3 Pro chip with 12‑core CPU, 18‑core GPU, 16‑core Neural Engine 36 GB memory 512 GB SSD Price: $2899 Question: Do you think I could fulfill the ML requirements using a Macbook Pro M3? Which config would be smart to buy in such case? Thankful for advice! submitted by /u/nizego [link] [comments]  ( 9 min )
    [D] Is Nvidia's new "System Memory Fallback for Stable Diffusion" also compatible with model training / inference in general?
    Hi all, today Nvidia released a new driver version that appears to allow the GPU to use system memory instead of crashing when it runs out, seen here: https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion I was wondering if this is compatible with model training and inference in Tensorflow and/or PyTorch, and how I could enable that (or if it would just work by default). This is especially confusing to me as I run Tensorflow in WSL so I don't know if this setting would carry over. submitted by /u/joshglen [link] [comments]  ( 9 min )
    [D] Can Diffusion Models really be Steered? Not really! Not even ControlNet, ICCV23 Best Paper
    While diffusion models (e.g. Stable Diffusion) are all the rage, they don't seem to be prepared for downstream tasks. ControlNet looks great (on paper), but open-source implementations for mere mortals aren't ready for prime time. Do you have examples that show the contrary? Will FAANGs and not-really-open Research Labs be the only ones capable of making it happen? submitted by /u/btcmx [link] [comments]  ( 9 min )
    [R] - An Open-sourced Data Contamination Reports for Llama Series Models
    Data Contamination in Multi-choice QA Benchmarks How much test samples are included in Llama's training data? This presented how much test samples in popular Multi-Choise QA benchmarks are included in the training data of Llama models (Common Crawl 2017–2020). Three types of data contamination: input-only contamination, input-and-label contamination, and all contamination containing both. Input-only contamination represents contaminations where only input part of test samples was included in the training data. On the contrary, input-and-label contamination indicate both input and the answer were included in the training data. Impact on Model Performance How much data contamination affects model evaluation? The full open sourced data contamination report: https://arxiv.org/abs/2310.17589 All data and code: https://github.com/liyucheng09/Contamination_Detector submitted by /u/Simple-Leopard-7646 [link] [comments]  ( 9 min )
    [R] Small Language Models Fine-tuned to Coordinate Larger Language Models improve Complex Reasoning
    Paper link http://arxiv.org/abs/2310.18338 Description We introduce DaSLaM, which uses a decomposition generator to decompose complex problems into subproblems that require fewer reasoning steps. These subproblems are answered by a solver. We use a relatively small (13B parameters) LM as the decomposition generator, which we train using policy gradient optimization to interact with a solver LM (regarded as black-box) and guide it through subproblems, thereby rendering our method solver-agnostic. Evaluation on multiple different reasoning datasets reveal that with our method, a 175 billion parameter LM (text-davinci-003) can produce competitive or even better performance, compared to its orders-of-magnitude larger successor, GPT-4. Additionally, we show that DaSLaM is not limited by the solver's capabilities as a function of scale; e.g., solver LMs with diverse sizes give significant performance improvement with our solver-agnostic decomposition technique. submitted by /u/Gaussian_Kernel [link] [comments]  ( 9 min )
    [P] Open-source modular observability for AI Systems
    Hi r/MachineLearning community! Over the last three years, we have collaborated with hundreds of teams to enhance our understanding of observability requirements in AI systems. ML teams are trying to log as much as they can at every possible dimension. To satisfy these needs, they must be able to: - Log anything from every part of the AI Infra, - Observe and interpret the logged data at scale in a flexible fashion, - Add layers and layers of automations around the logged data. Thrilled to share with you the new product we built - AimOS, a framework to connect the dots and ensure modular observability for AI Systems! Easily log, connect, and observe any part of your AI Systems – from experimentation and production stages to input prompts and monitoring. AI Systems are not determ…  ( 10 min )
    [D] Do you calculate the accuracy and loss of a neural network or batches or the whole dataset?
    Hello, I'm curious, when evaluating a neural network for both the training and validation data, do you calculate the accuracy across the entire dataset, or at every batch and then find the average at every batch? submitted by /u/CrunchyMind [link] [comments]  ( 9 min )
    [D] Unsupervised Clustering without knowing number of classes
    Does anyone know where to find the best models for unsupervised clustering problems that don't specify the number classes? For example I googled unsupervised MNIST but IIC which holds the record requires the output dimension (k=10) to be specified? Is there a name for unsupervised clustering without knowing the number of classes? (I know of density/hierarchical clustering algorithms but am unaware of many deep learning ones) And specifically are results charted anywhere? I'm researching the topic and it seems knowing the number of things you're looking for is half the battle. I can find papers on methods that aim to find the number of clusters etc but are there any benchmarks to compare? submitted by /u/BigBrainUrinal [link] [comments]  ( 9 min )
    [R] Announcing Distil-Whisper - 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution
    Hey r/MachineLearning! ​ At Hugging Face, we've worked hard the last months to create a powerful, but fast distilled version of Whisper. We're excited to share our work with you now! Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations. ​ For more information, please have a look: - GitHub page: https://github.com/huggingface/distil-whisper/tree/main - Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf ​ Quick summary: Distillation Process We've kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used. Data We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out. Results We've evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations. Robust to noise Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training. Pushing for max inference time Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01! Checkpoints?! Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT. ​ submitted by /u/pvp239 [link] [comments]  ( 9 min )
    [D] model for "understanding" dashcam images
    i have some dashcam footage from my car and want to see how a model could embed the images in an unsupervised (or self-supervised) way so i dont have to label everything. like if scenarios that are semantically similar, but different in pixel space (pulling out the driveway in the day versus pulling in at night) could be clustered close-ish together in latent space so that i could label fewer images and have the model get the other using something like k-nearest or whatever. i am starting off with just frame level before i try to tackle videos as a sequence of images (will probably lose interest by that point, so want to get images dealt with first). i looked in to VAEs and tried training one from scratch on my data but i dont have enough compute power for that. does anyone here have any ideas about this? any pretrained off the shelf models that i could use for this? any leads for a literature survey? submitted by /u/samrus [link] [comments]  ( 9 min )
    [D] - People here who mastered out of their PhDs, do you regret it? How has your life been after that?
    Hi everyone. I'm a third-year Ph.D. student doing ML-related research and have no publications so far. I do have a couple of ongoing projects that will lead to first-author papers in the next few months. I'm also at a point where I think mastering out and getting a job might be a better option. But I also worry that I might regret not getting a Ph.D. I love research but I feel the academic environment is not a good fit for me. I just want to hear from people who were in a similar position as me. Did you stick through your PhD or did you master out? How has the life been after that? I started my PhD straight out of my undergrad so I didn't get any industry experience. So I've been thinking that getting a job with my master's and spending some time in the industry could be a good option. Then I can return back to grad school if I still have that urge. Or I can simply brave through my current situation and just get a PhD and then work in the industry. Any opinions are welcome! submitted by /u/llmlift [link] [comments]  ( 9 min )
    [P] Web-based AI content generative tools that require no inscription to be used?
    I need some AI generative tools for a state-sponsored 1 hour course a friend of mine is hosting for people aged from 12 to 60 years old. They are taking an "induction" into AI with the goal of learning how to produce content with the help of AI. The idea is that they have a range of web-based AI tools at disposal in a computer in order to do a "collaborative creative film" of sorts. The different people in the course will be grouped into pairs, and then they will choose some of AI tools in a list to help them with the creative process. The AI tools in the list should not require inscription. The no-inscription rule it's a requirement because of data protection laws (no personal e-mail or other personal info should be given). Also, it would be preferable that, if the AI requires text inpu…  ( 10 min )
    [D] ctransformers vs llama-cpp-python
    what's the difference between ctransformers and llama-cpp-python? submitted by /u/kaoutar- [link] [comments]  ( 8 min )
    [R]What can cause the silhouette score to increase like this at the end? Value of K=10 for about 200 dataset seems too much, no?
    ​ https://preview.redd.it/gkibii705kxb1.png?width=504&format=png&auto=webp&s=ae2820ab0d324224177fa971319cd82965eb0129 submitted by /u/Gandalfthebrown7 [link] [comments]  ( 8 min )
    [D] Using C++ for training a Vision Transformer Agent
    Hey! So, I am currently enrolled in a Master's Degree program and started my thesis this semester. I still have a year to develop it, so what I'm doing now is gathering as much info as I can about libraries for ML. My adviser and I have decided that we're going to use Vision Transformers as our approach to train an agent to inspect products at factories. So, a little background on me: I am a self-taught game developer. I've learned Lua, C, C++, and C# to make games, that's what I'm good at. I've studied the basics of ML at uni, but it was on Python. My most proficient language is C++, as I've worked a lot with it and feel comfortable with it, so I was thinking: Are there any good ML libraries for C++? Libraries that are as easy to use as Python libs (TensorFlow, Pytorch, etc.), for example? I love the way that you have control over the resources when coding with C and/or C++. Runtime speed is a bonus too. Thanks for the help :) submitted by /u/retroJRPG_fan [link] [comments]  ( 9 min )
    [P] Visualizing Geopolitics with UN Voting data and Embeddings
    Hi all, I'm upskilling myself in Machine Learning. As a learning project, I trained an embedding model to group countries in the UN based on their voting habits. One I thought the results are interesting so I wanted to share. Medium post here: https://medium.com/@sambhattacharyya/visualizing-geopolitics-with-un-data-and-machine-learning-b93f84270900 View the the interactive 3d graph of embeddings here: https://sambhattacharyya.com/visualizing-geopolitics/index.html, 2) I'm still learning, so if anyone has feedback on the technical approach (I'm especially sure that using MSE loss isn't ideal here). Colab notebook here: https://colab.research.google.com/drive/1YkM_AsHCcs_boOCqobyQAs38Q43cm1Th#scrollTo=6801c41a-03d2-447f-8929-15d4f399df0f Raw dataset I compiled here: https://huggingface.co/datasets/sam-bha/un-general-assembly-votes-2000-2023 submitted by /u/sam_bha [link] [comments]  ( 9 min )
    [D] has anyone experimented with neural search for e-commerce applications?
    Hi. I’m an AI engineer of an emerging retailer. We’re continuously pushing the boundaries of our user search experience. We’ve got a massive inventory, hence a lot of data to be managed. This got me thinking about the untapped potential of neural search. I've had my hands on OpenAI's GPT and Deepset's Haystack lately. Both tools are great in specific scenarios, but integrating them seamlessly at an enterprise scale is challenging, especially when we're talking about real-time user interactions. The primary challenge remains in managing multimodal data efficiently without sacrificing speed. To add context, my goal in leveraging something like GPT for e-commerce is to create a more intuitive, conversational, and responsive search function. Imagine a user typing in a vague description or query, and the system providing product suggestions like a seasoned salesperson would. Given the vast product range, the neural search could bridge the gap between user intent and the most relevant product offerings. If anyone has experience with this I’d like to hear your thoughts, and if you have any other tool recommendations for this pls do share. I’d be grateful for any help submitted by /u/PositiveFixing12 [link] [comments]  ( 9 min )
    [D] Is this close enough to be usable? Need your inputs: Automated RAG testing tool. AI Data Pipelines for Real-World Production (Part 3)
    Hey there, Redditors! I'm back with the latest installment on creating dependable AI data pipelines for real-world production. If you've been following along, you know I'm on a mission to move beyond the "thin OpenAI wrapper" trend and tackle the challenges of building robust data pipelines. With 18 months of hands-on experience and many user interviews, I realized that with the probabilistic nature of systems, we need better_testing.gpt: 1. As you build you should test The world of AI is a fast-moving one, and we've realized that just working on systems is not an optimal design choice. By the time your product ships, it might already be using outdated technology. So, what's the lesson here? Embrace change, test along, but be prepared to switch pace. 2. No Best Practices Yet for R…  ( 10 min )
    [P] A site where you can ask the same question to GPT-2, GPT-3, GPT-3.5 and GPT-4, and compare the outputs
    Hi /r/machinelearning! I've been working with my collaborators on a site where you can compare OpenAI models to get a sense of the improvement over time of the models: https://theaidigest.org/progress-and-dangers https://preview.redd.it/khruhgkp7jxb1.png?width=1960&format=png&auto=webp&s=21d13125145f7fae7351686d4078868d65cbf8c3 It includes a number of things that you might be interested in: You can ask any question and compare the outputs from the OpenAI models: https://preview.redd.it/s5e9acev8jxb1.png?width=1458&format=png&auto=webp&s=0c3e5ba3661fccfc4f4ba60db346b6142b1e52f3 Visualises OpenAI models benchmark performance across 22 benchmarks: https://preview.redd.it/vhai63308jxb1.png?width=1948&format=png&auto=webp&s=07f65f131b2e6d5122400120a11d24205b7d08d6 Shows examples of benchmark outputs for GPT-2 to GPT-4 https://preview.redd.it/f3p7ni068jxb1.png?width=1980&format=png&auto=webp&s=dfe25c8c4a486a0df3c4cce2e4497fd250163bd1 Discusses some dangerous emerging capabilities, such as biological weapons: https://preview.redd.it/n6hinz7b8jxb1.png?width=2002&format=png&auto=webp&s=70cf0a0c228e1ac194146040c23a7f41dfe4e09a Includes an example of a simple agent autonomously exploiting a vulnerability in a game's code: https://preview.redd.it/5a584w7f8jxb1.png?width=1944&format=png&auto=webp&s=3867a865c06b6e36fc2424f6ced038248ee0cafd I hope you'll find this a valuable resource for getting familiar with older LMs, comparing the outputs, and thinking about what's next in this space. Here's a link to the site: https://theaidigest.org/progress-and-dangers submitted by /u/timegentlemenplease_ [link] [comments]  ( 9 min )
    [D] Problems with WGAN for time series imputation
    I am trying to implement a WGAN for time series imputation. However, I am facing many problems due to the fact that I am a novice in generational models and I am not used to many concepts related to WGAN. ​ I know that WGAN have a problem with critical weights that make them diverge infinitely, and there are three main ways to solve this problem as discussed in this other post. However, implementing them with the use of recurrent neural networks have been a total headache. ​ 1- With a weight clipping with a threshold of 0.1 (as suggested in the literature) my critical loss does not change at all. I think this may be due to a gradient vanishing problem in recurrent networks. ​ 2- Using Spectral Normalization is not well implemented in Torch for recurrent layers as it only handles one layer weight. ​ 3- I cannot use Gradient Penalty because its implementation, as far as I know, depends on an interpolation of the critical value of an interpolated data between real and generated samples, which is not available to me as all samples have a mixture of real and missing values. ​ Is there a solution to this problem that I am missing? submitted by /u/SrPinko [link] [comments]  ( 9 min )
    [D] How to train a model that'll generate an "average" image based on a large set of images?
    There's a lot of image generators that'll allow you to generate multiple images based on the input data. Is there something that'll generate an image that's an average of the trained model instead? Didn't plan on using machine learning for this project but realized it might be an interesting path to explore. submitted by /u/max_b_jo [link] [comments]  ( 9 min )
    [D] Fast and reliable keyword extraction
    Please forgive my ignorance as I am a software engineer and fairly new to datascience and machine learning. Also, feel free to delete my post if this is not the right place to ask. I am currently working on a bookmark manager app that offers content preservation and automatic keywords extraction among other features to extend these bookmarks and make them more discoverable. For the life of me, I can't find a reliable way to extract keywords. I so far tried to use a python library called Newspaper3k But results were a mixed bag, half of the time, it will knock it out of the park with very accurate results, the other half, it will just output garbage. I have switched to using openai gpt3.5 APls but I really hate it. It's slow, very verbose and give me feeling of disgust because it's like using a machine gun to get rid of a fly. I have looked in huggingface and tried a couple of models but no luck so far. Please help. I am happy to selfhost something or just pay for a good APls submitted by /u/goodkernel [link] [comments]  ( 9 min )
  • Open

    How to use neural network to learn a Q table?
    I have studied the Q learning algorithm and applied it to the classic gridworld problem. I was able to use the update formula to generate the correct Q table. Now I have been assigned to generate the Q table using a neural network, rather than the update formula. However, I do not understand how a neural network could be used to learn a Q table. I would say that the input should be the state of the agent, and the output should be an action. But how do I know how many layers I should make? And how many nodes in each layer? And how do I optimize the weight? Any guidance would be immensely appreciated. submitted by /u/Parking_Antelope8865 [link] [comments]
    CodeFusion: A Pre-trained Diffusion Model for Code Generation
    submitted by /u/nickb [link] [comments]
    RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models — Together AI
    submitted by /u/nickb [link] [comments]
  • Open

    Two ChatGPTs Break the Silence: An Unmissable Verbal Showdown on AI Ethics!
    Hey guys I made a video earlier last night using two ChatGPT accounts with custom instructions running GPT4 on voice and had them have a debate over the ethics of AI, I thought it was pretty interesting and fun to do. I wonder what other fun things I can experiment and make the two of them do lol. https://www.youtube.com/watch?v=fFoyCiAwmfY submitted by /u/adamariefox [link] [comments]
    ChatGPT, let us create movies about ChatGPT.
    submitted by /u/Philipp [link] [comments]
    Artificial intelligence in sport — the key to success?
    submitted by /u/donutloop [link] [comments]
    Nvidia tests chatbots in chip design process in bid to use more AI
    Nvidia is testing chatbots in the chip design process to incorporate more AI. The company has used a large language model augmented with 30 years of chip design data to create chatbots that can answer questions from junior designers, saving senior designers time. The research also found that adding specific data from the company's experience can make a relatively modest chatbot more accurate than an advanced one. Nvidia demonstrated the use of AI to generate code, aiming to enhance engineers' productivity rather than replace them. Source : https://www.reuters.com/technology/nvidia-tests-chatbots-chip-design-process-bid-use-more-ai-2023-10-30/ submitted by /u/NuseAI [link] [comments]
    FULL LIST: Who is attending Britain's AI Safety Summit tomorrow?
    submitted by /u/TBP-LETFs [link] [comments]
    Happy Halloween! Choose your house
    submitted by /u/Sea_Permit5660 [link] [comments]
    Elon Musk to attend Rishi Sunak’s AI safety summit in Bletchley Park
    submitted by /u/nick9000 [link] [comments]
    Exclusive: G7 to agree AI code of conduct for companies
    submitted by /u/donutloop [link] [comments]
    Biden administration aims to cut AI risks with executive order
    submitted by /u/donutloop [link] [comments]
    One-Minute Daily AI News 10/30/2023
    India leads the way in global AI skill penetration, finds Stanford University’s AI Index Report.[1] Google Bard, the conversational AI tool by Google, can now respond to your questions in real-time. You can turn it off and tell Bard to only respond once the answer is complete, but now, by default, Bard writes out the response in real time.[2] Chinese technology giant Alibaba said on Tuesday it has updated its artificial intelligence (AI) model Tongyi Qianwen and released a suite of industry-specific AI models amid an intensifying AI race among tech companies.[3] Biden signs sweeping executive order regulating artificial intelligence.[4] Sources: [1] https://www.firstpost.com/tech/news-analysis/india-leads-in-ai-skills-and-github-ai-projects-says-stanfords-ai-index-report-13318412.html [2] https://searchengineland.com/google-bard-can-now-respond-in-real-time-433954 [3] https://finance.yahoo.com/news/1-alibaba-upgrades-ai-model-035625254.html [4] https://www.thedailynewsonline.com/news/biden-signs-sweeping-executive-order-regulating-artificial-intelligence/article_d461cda8-7737-11ee-8036-93d6b4aa3413.html submitted by /u/Excellent-Target-847 [link] [comments]
  • Open

    DSC Weekly 31 October 2023
    Announcements Top Stories In-Depth The post DSC Weekly 31 October 2023 appeared first on Data Science Central.  ( 20 min )
    Data engineering career guide
    From ensuring that mobile apps function smoothly to facilitating personalized recommendations and targeted ads, data engineering powers the digital experiences that have become part of many of our day-to-day lives. There is currently a major need for knowledgeable and skilled professionals to fill open data engineer roles. Do you have the skills and experience to… Read More »Data engineering career guide The post Data engineering career guide appeared first on Data Science Central.  ( 21 min )
    Approaches to creating virtual fitting room software using AR and AI
    Virtual fitting room software with AR and AI is the next best alternative to physical stores. With many different kinds of virtual fitting room solutions on offer though, it can be hard to know which ones are the most feasible for your business. Let’s talk about the various approaches to developing such solutions.  Types of… Read More »Approaches to creating virtual fitting room software using AR and AI The post Approaches to creating virtual fitting room software using AR and AI appeared first on Data Science Central.  ( 21 min )
    How hybrid AI can help LLMs become more trustworthy
    Back in 2015, Pedro Domingos of the University of Washington’s computer science department published The Master Algorithm. In the book, Domingos explored the possibility that one master algorithm could indeed rule them all. The main challenge, he said, was to bring the AI tribes together so that the strengths of the various approaches could be… Read More »How hybrid AI can help LLMs become more trustworthy The post How hybrid AI can help LLMs become more trustworthy appeared first on Data Science Central.  ( 21 min )
    AGI Jesse and the future of finance
    Introduction The financial world is on the cusp of a remarkable transformation, thanks to the integration of advanced AI models like GPT-4. In this article, we delve into the evolving landscape of Machine Learning (ML) in finance and explore the potential impact of these cutting-edge AI systems. The need for speed: Reacting to regime shifts… Read More »AGI Jesse and the future of finance The post AGI Jesse and the future of finance appeared first on Data Science Central.  ( 18 min )
    How AI chatbots are transforming the world?
    AI chatbot technology has taken the world by storm. From assisting individuals with their content requirements to facilitating top-notch customer service for businesses, AI chatbot technology offers it all.  The technology has penetrated multiple prominent industries. And rightly so. AI chatbots provide customer insights that help businesses strategize their marketing and sales plans, ultimately driving… Read More »How AI chatbots are transforming the world? The post How AI chatbots are transforming the world? appeared first on Data Science Central.  ( 22 min )
    How technical program managers can build a robust Generative AI future
    The modern digital ecosystem, buzzing with the chatter of data and algorithms, presents both promises and challenges. In this intricate web, generative artificial intelligence (GenAI) shines as a beacon of innovation. To harness this power, enterprises need more than just cutting-edge technology. They need a bridge between ambition and realization—a role aptly filled by… Read More »How technical program managers can build a robust Generative AI future The post How technical program managers can build a robust Generative AI future appeared first on Data Science Central.  ( 21 min )
    Generative AI ethics: Navigating the boundary between human and machine creativity
    Generative AI is revolutionizing our creative landscape, unlocking unprecedented possibilities. But at what cost? Dive into the ethical dilemmas of this transformative technology, exploring the fine line between innovation and ethical consideration.  2022 was a huge year for Generative AI. The release of DALL-E 2 in April showed the public the possibilities of text-to-image Gen… Read More »Generative AI ethics: Navigating the boundary between human and machine creativity The post Generative AI ethics: Navigating the boundary between human and machine creativity appeared first on Data Science Central.  ( 23 min )
  • Open

    Riding the Rays: Sunswift Racing Shines in World Solar Challenge Race
    In the world’s largest solar race car event of the year, the University of New South Wales Sunswift Racing team is having its day in the sun. The World Solar Challenge, which first began some 35 years ago, attracts academic participants from across the globe. This year’s event drew nearly 100 competitors. The race runs Read article >  ( 6 min )
    DLSS 3.5 With Ray Reconstruction Now Available in NVIDIA Omniverse
    The highly anticipated NVIDIA DLSS 3.5 update, including Ray Reconstruction for NVIDIA Omniverse — a platform for connecting and building custom 3D tools and apps — is now available.  ( 7 min )
  • Open

    Schneider Electric leverages Retrieval Augmented LLMs on SageMaker to ensure real-time updates in their ERP systems
    This post was co-written with Anthony Medeiros, Manager of Solutions Engineering and Architecture for North America Artificial Intelligence, and Blake Santschi, Business Intelligence Manager, from Schneider Electric. Additional Schneider Electric experts include Jesse Miller, Somik Chowdhury, Shaswat Babhulgaonkar, David Watkins, Mark Carlson and Barbara Sleczkowski.  Enterprise Resource Planning (ERP) systems are used by companies to […]  ( 10 min )
  • Open

    Control of spider like robots using GNN & RL
    hii this is my project but I have bare minimum coding knowledge and know even lesser about GNN and RL, has anyone worked on something like this before and would be able to dumb it down for me?? submitted by /u/sunshinebreakfast7 [link] [comments]
    RL & LLMS: An in-depth look at modern Game Playing AI Systems
    submitted by /u/AvvYaa [link] [comments]
    Why Gym/Gymnasium removed done from the step function
    submitted by /u/jkterry1 [link] [comments]
    "Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier", D'Oro et al 2023
    submitted by /u/gwern [link] [comments]
  • Open

    The spookiest Halloween scenes
    Google Bard has the ability to describe images. But it turns out what you get depends a lot on how you ask. I gave Bard this image and the prompt "Please describe this spooky Halloween scene". On the right is the image I got when I took the  ( 6 min )
    Bonus: more spooky Halloween scenes
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Positive polynomials revisited
    The square of a real-valued polynomial is clearly non-negative, and so the sum of the squares of polynomials is non-negative. What about the converse? Is a non-negative polynomial the sum of the squares of polynomials? For polynomials in one variable, yes. For polynomials in several variables, no. However, Emil Artin proved nearly a century ago […] Positive polynomials revisited first appeared on John D. Cook.  ( 5 min )

  • Open

    Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence [N]
    https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/ It looks like content will have to be labeled, showing if it's AI-generated or not. And special rules will apply to: any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations, or using primarily biological sequence data and using a quantity of computing power greater than 1023 integer or floating-point operations; and any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 1020 integer or floating-point operations per second for training AI. Also, easier visas for "AI talent". submitted by /u/we_are_mammals [link] [comments]  ( 9 min )
    [D] Relevance Extraction in RAG Pipelines
    I came across this interesting problem in RAG, what I call Relevance Extraction. After retrieving relevant documents (or chunks), these chunks are often large and may contain several portions irrelevant to the query at hand. Stuffing the entire chunk into an LLM prompt impacts token-cost as well as response accuracy (distracting the LLM with irrelevant text), and and can also cause bumping into context-length limits. So a critical step in most pipelines is Relevance Extraction: use the LLM to extract verbatim only the portions relevant to the query. This is known by other names, e.g. LangChain calls it Contextual Compression, and the RECOMP paper calls it Extractive Compression https://twitter.com/manelferreira_/status/1713214439715938528 Thinking about how best to do this, I realized i…  ( 10 min )
    [D] Face Recognition
    I am working for a months on facial recognition system. I have 500 classes with 10 template for each class. I have applied almost and tested almost every model. Some of them are dlib 128 dim. Facenet (512), Insightface, Arcface. But none of them reduced the rate of false positive. I also fine tuned those models, I also trained a clasifier and I reached to 91 percent of accuracy with a very low quality images and also have a backlight effect. But the what I want my setup to be 99 percent which correctly classifies the person with a high similarity on true positive and a low similarity around 0.1 or 0.2 for an unknown class. Now my problem is that similarity is also high with unknown almost around 99 and I also used some distance metrics and it also not helping out. Note: the environment is a real world environment and I used CCTVs at a place where hundreds of unknown people visit daily. submitted by /u/No_Garbage9512 [link] [comments]  ( 9 min )
    [D] Computation graphs with in-place operations
    With a computation graph typically used to represent NNs, where nodes represent operations and edges represent data dependencies, is it possible to represent in-place operations? submitted by /u/thanrl [link] [comments]  ( 9 min )
    [D] How to evaluate if LLMs are following certain guidelines or not?
    I recently wrote a blog on evaluating whether your LLM applications are following required guidelines: https://uptrain.ai/blog/lost-in-translation-the-critical-impact-of-neglecting-guideline-adherence-in-llms?utm_source=reddit&utm_medium=reddit&utm_campaign=reddit Please let me know your feedback in the comment section submitted by /u/Vegetable-Skill-9700 [link] [comments]  ( 9 min )
    [R] RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models
    Blog: https://together.ai/blog/redpajama-data-v2 Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 GitHub: https://github.com/togethercomputer/RedPajama-Data Description: RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated. submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [R] FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions
    Paper: https://arxiv.org/abs/2310.15421 Code: https://github.com/skywalker023/fantom Blog: https://hyunw.kim/fantom/ Abstract: Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM 👻, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning. ​ https://preview.redd.it/mxb85o2vkexb1.png?width=1367&format=png&auto=webp&s=8749cddd15e6740e69ae47ef5edf3a1da96d89c2 submitted by /u/APaperADay [link] [comments]  ( 9 min )
    [D] Who's A/B testing in prod?
    I never hear much about A/B testing in the ML community. Do you all tend to A/B test your changes? I have to imagine this is becoming more prevalent with LLMs and prompts. In my experience, every model change we made was through an A/B test. There is some literature in our domain that offline metrics may not correlate to online metrics, so we looked at both very closely. My team was heavily involved in A/B testing as well as managing the data pipeline from those exposures to create training data for the models. What do your experience look like? submitted by /u/mtbarta [link] [comments]  ( 9 min )
    [P] Looking for partner for a doing a research paper in ML/NLP politeness identification
    [P] Dm for further info but all in all its abt adding an extra feature (feature engg.) To our ML model to detect politeness in a language submitted by /u/PrudentFly8507 [link] [comments]  ( 9 min )
    [P] Deconstructing interface design for generative AI
    Recently, niji journey launched a mobile app, and there, they talked about how hard it is for people who are new to the concept of generative AI art to get started, and some of the workarounds they came up with. https://sizigi.notion.site/Kindling-the-Spark-of-Inspiration-d8edd06cede04401b4efba8324005cf3?pvs=4 Thought it would be interesting to share, especially with how they try to push forward the ideas from other users as a font of knowledge kind of thing. submitted by /u/kalmatos [link] [comments]
    [D] WER improved before Fine-Tuning Whisper, but increased afterward: Why?
    Hey everyone, I've been working on fine-tuning Whisper using generated transcribed audio data. Before I began the fine-tuning, I evaluated the base model's accuracy on a test set, which showed: Test_accuracy: WER = 23.078% I then fine-tuned the model for 3,000 steps using a dataset of 1,000 samples: 700 for training and 300 for testing. This amounts to roughly 4 hours of data, with each audio clip being around 30 seconds in duration. However, post-fine-tuning, the WER started off at 30%, which is higher than the base model's 23.078%. Intuitively, I'd expect the WER to start lower after fine-tuning. Does anyone have insights or suggestions on what might be causing this discrepancy? Any advice would be greatly appreciated! https://preview.redd.it/b619v95rgdxb1.png?width=906&format=png&auto=webp&s=fa17f311301919ccfcd7a8cea47f47478e1cd91d Edit: Solved the issue. submitted by /u/stoicbats_ [link] [comments]  ( 9 min )
    [R] Do any solutions exist for this ranking optimization problem
    Let’s say I already have a ranking method (a learning to rank approach) to sort search results (items) based on their utility to user. However, I also want to make sure certain groups of items get enough exposure across all the searches. I have defined target exposure for each group but I’m struggling to find a method that would balance utility and this exposure. Particularly because utility can be defined for a given search and item, but the group target exposure is only defined for an aggregated set of search results. Any ideas for a multi objective optimization method or even a post hoc re-ranking approach? submitted by /u/MLE-MAP [link] [comments]  ( 9 min )
    From Data Architect to Machine Learning Enthusiast [D]
    Please read, clap and follow: https://medium.com/@andysingal/from-data-architect-to-machine-learning-enthusiast-e132f6cd35fc?sk=d7d2a04a08ca7f328c4ae7ee3769ec50 submitted by /u/Fit_Maintenance_2455 [link] [comments]
    [D] Pixel Perfect Segmentation Datasets?
    Hi! Looking for any insights/direction on high-quality (near pixel-perfect) segmentation datasets for benchmarking various segmentation models in a way that others can reproduce - not sure if these exist or if anyone has specific examples they have worked with? Working with the usual suspects for segmentation like ADE20K and Cityscapes I have seen results using the pre-trained models on these datasets are a bit disappointing wrt high quality (clean edges, label consistency) segmentation - even models considered SOTA in 2023 like OneFormer/SegFormer/etc. Though my hypothesis is that this is largely due to the label approximations (many are drawn as rough polygons) and labeling inconsistencies building these large datasets with many classes. My observations and personal experience is many private companies have custom, high-quality segmentation data they have produced internally at high financial/time cost and thus are reluctant to open source and share, but hoping there are some hidden gems out there I don't yet know about... submitted by /u/FocalAIDev [link] [comments]  ( 9 min )
    [P] Predicting velocity vectors with a CNN
    Hi all I am, working on a project to predict velocity vectors from boundary conditions... I have carried out a simulation of an office on Ansys I have x,y,z velocity vectors I have repeated the simulation for 100s of ac inlet and outlet velocities as input data for each set of 3D-Velocity vectors.Example of the dataset for each inlet, outlet permutation, the dataset for each permutation is around 1 million coordinates + xyz velocity vectors long. *** I now want to predict the x,y,z velocity vectors at each location based on the AC inlet and outlet velocity *** How do I organise the data best for a CNN/other ANN? Any other tips of the CNN architecture? submitted by /u/No_Range3026 [link] [comments]  ( 9 min )
    [N] Fast GPT Training Infra, FP8-LM, being 64% faster than BF16 on H100—Unlocking even more gigantic GPT
    I just discovered the FP8-LM paper from MS: [2310.18313] FP8-LM: Training FP8 Large Language Models (arxiv.org). This is their repo link: Azure/MS-AMP: Microsoft Automatic Mixed Precision Library (github.com) paper abstraction My Key Takeaways: The whole-loop for FP8 “GPT-style” large model training is successfully done by FP8-LM team, including data cleaning, infrastructure development, model pretraining, alignment (SFT, RS, RLHF, etc.) Their FP8 mixed-precision training framework got 42% reduction in memory usage, and ran 64% faster than BF16 Megatron-LM; also faster than Nvidia Transformer Engine by 17% ​ https://preview.redd.it/jeaadb1jncxb1.png?width=793&format=png&auto=webp&s=2175969217ff0ff3c8149d17b8011408f4f84c91 It is thrilling to think about that we can scale up the already gigantic model size by 2.5x without needs for more GPU memory…and this can be achieved with NO performance degradation on a wide range of benchmarks as demonstrated in the paper. ​ https://preview.redd.it/vlu6o5cnncxb1.png?width=1389&format=png&auto=webp&s=ed97ea1431f8d9a2900490812f23131681c788f8 ​ https://preview.redd.it/murtte9oncxb1.png?width=1289&format=png&auto=webp&s=6ebd242d69380f2bd95dcd2fa2afe18d7c4b3667 submitted by /u/TensorTamer [link] [comments]  ( 9 min )
    [P] Active Learning with Domain Experts – Sort of a case study
    Hey, r/MachineLearning, it’s Dean from DagsHub 🐶 We recently had the opportunity to work with a domain expert in creating a machine learning model and we thought we should share what we learned in the process. A dentist reached out to us to help him create a machine learning model, which could segment teeth in panoramic X-rays. He had some data pre-labeled, but the vast majority of his dataset was unlabeled. Since labeling these X-rays is a time consuming process and requires domain knowledge, we decided to use Active Learning. Following our success in creating an Active Learning pipeline in a Jupyter Notebook using Data Engine, we created a new Tooth Fairy project, which expands on that and brings even more capabilities into the notebook. https://dagshub.com/blog/active-learning-with-domain-experts-a-case-study/ Check out our post and learn: Why and when you should use Active Learning How to efficiently work with domain experts (and mistakes to avoid!) What a real use-case Active Learning pipeline looks like, by checking out the accompanying repo We value your thoughts and feedback! Looking forward to hearing from you all! submitted by /u/PhYsIcS-GUY227 [link] [comments]  ( 9 min )
    [D] Semantic search on different medical codes
    Hi everyone, Need some ideas to bounce off. I have several medical codes, let’s name them A, B, C and D. Each medical code consists of multiple clauses, say, 1.1, 1.2 and so on. I want to create a model (?) where a text input of a textual clause will show up all other related clauses from different medical codes. For example, if I input clause 3.2 from medical A, I want the output to show up the related/similar clauses from code B, C and D. I have thought of using something like a Retrieval Augmented Generation for this, but anyone has any better ideas regarding this topic? Could a language model do something about this? Thanks! submitted by /u/plsendfast [link] [comments]  ( 9 min )
    Predicting clusters with regression [D]
    Hello. I have historical data of real estate. Includes attributes like price, revenue, maintenance, maintenance debt etc. I had two ideas for the data previously, to predict future real estate price and to use some clustering algorithm to put the properties into categories (good, average bad or something like that). Now I got this idea to generate clusters for any given point in time and use that history of clusters to predict migration of properties between clusters. I am aware of inter cluster migration estimation but that seems to be a prediction of how points shift inside a cluster with the introduction of new data points rather than a timeline of the movement of points in the cluster. Writing this I'm also thinking it might be possible to simply treat the history of clusters as a category to be predicted e.g. given the history of property X that is in category A the prediction is that in 6 months it's heading for cluster B. Does anyone have experience with similar work? Are there papers I am missing? Thanks in advance. submitted by /u/arachnarus96 [link] [comments]  ( 9 min )
    [D] Classification problem giving me white hair
    So! I am a ML newbie and was wondering if one of you pros can help me out on my learning journey (tool use = google colab). I have a csv file containing loan data where each row is a customer that applied for a loan. One of the columns is called TARGET and it shows whether the customer's loan request was approved or not. All sorts of data points are captured e.g. age, gender, salary, employment details like industry, assets, etc. I've done cross validation and found that GradientBasedClassifier and LGBM perform the best. Cross validation also tells me that their accuracy is between 68%-70%. My problem is that I SUCK at hyper param optimisation. How do you go from 68 to +80%??? Or 90%? For the curious ones, here is the dataset: https://drive.google.com/file/d/1IKNVstck6gnXvfGS-mVRMAE1RFrDNUgZ/view?usp=sharing submitted by /u/Critical_Ad_1205 [link] [comments]  ( 9 min )
    How to extract features from XML of svg images dataset to create features? [D]
    I want to create several labels for each of these images, such as no. of bedrooms, bathrooms, etc. I have a large dataset of 2000 images, each which have XML which contains text that labels each room with the ID.How do I automate the process, and create a dataset that includes all the features for the labels, so that it's ready for machine learning? https://svgshare.com/i/z53.svg submitted by /u/pranksbanker [link] [comments]  ( 9 min )
    [R] The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
    submitted by /u/Jean-Porte [link] [comments]  ( 8 min )
    [D] How do you deal with LLM observability? What tools do you guys use?
    I want to know the tools and methods you use for the observability and monitoring of your ML (LLM) performance and responses in production. submitted by /u/Ok_Cartographer5609 [link] [comments]  ( 9 min )
    [R] SuperResolution: DLSS, VSR
    Hey everyone! I recently delved deep into the fascinating world of Super Resolution and put together a tutorial covering both SRGAN and ESRGAN. Plus, for my fellow gamers out there, I've included a comparison between DLSS and VSR in gaming scenarios. For those interested in the technical details, there's also a hands-on PyTorch walk-through of ESRGAN. I put a lot of effort into making the content approachable and informative, so whether you're a newbie or a seasoned pro, there should be something for everyone! Check out the video here: https://youtu.be/Z0jl8YM5kzU Would love to get your feedback, thoughts, or any insights you might want to share! Cheers! 🍻 Disclaimer: This video provides a broad overview of SuperResolution. NVIDIA's DLSS technology involves real-time rendering complexities that may not be fully detailed. For a deeper technical dive, we value and encourage viewers own research. We can have a separate video on DLSS as well. submitted by /u/Worldly-Inflation-92 [link] [comments]  ( 9 min )
  • Open

    Identifiers depend on context
    Can you tell who someone is from their telephone number? That’s kinda the point of telephone numbers, to let you contact someone. And indeed telephone number is one the 18 identifiers under HIPAA Safe Harbor. But whether any piece of information allows you to identify someone depends on context. If you don’t have access to […] Identifiers depend on context first appeared on John D. Cook.  ( 5 min )
  • Open

    Samsung apps
    Hello everyone, I've recently started tinkering with ai art generators and could use some advice. I'm on android and currently using magir app, I'm wondering if there are any free ai art generators with no restrictions/ have nsfw etc that doesn't come with a pay wall, magir has lifetime pro for $40, I'm still learning but certainly see the potential so I'm asking for group knowledge please and thank you in advance. submitted by /u/gundamt51 [link] [comments]
    Europe's newest AI hub is being built in a German city no one's heard of
    submitted by /u/donutloop [link] [comments]
    Anyone tried boosting GPT with other AI tools? What were your results?
    TL;DR - Title. I’m a grad student and had to do some work on a study on the socio-economic impacts of AI and developing an interactive educational platform to ease learning for visually impaired students. I’ve had to do ‘delegate’ some work to chatgpt, and the results have been kinda unimpressive. I shared this with a colleague and he said Ai results are like that, and if I wanted better results I could mix and match, or use other ai tools to ‘boost’ (?) gpt. Is this a viable strategy, or do I have to make do with whatever I have? submitted by /u/CrispOriginality [link] [comments]
    Humanity at risk from AI 'race to the bottom', says tech expert
    A tech expert warns that unrestrained AI development by a few tech companies is endangering humanity's future. The expert calls for AI safety standards and regulation to prevent the reckless development of powerful AI systems. In a policy document, AI experts argue that governments should have the authority to halt the development of exceptionally powerful AI models. Concerns about the development of artificial general intelligence, which can perform tasks at or above human levels, are also raised. The article mentions the investments made by Amazon, Microsoft, Alphabet, and Facebook's Meta in AI and cloud computing. Source : https://www.theguardian.com/technology/2023/oct/26/ai-artificial-intelligence-investment-boom submitted by /u/NuseAI [link] [comments]
    Why does Google Tensor exist if Snapdragon is better at AI? Short
    The article discusses the existence of Google Tensor in light of the impressive AI capabilities of the Snapdragon 8 Gen 3 chip. While Tensor has been praised for bringing AI breakthroughs to Pixel phones, some argue that many of its AI features actually rely on an internet connection and offload tasks to the cloud. In comparison, the Snapdragon chip can perform on-device AI tasks, such as generating images, quickly and without the need for an internet connection. Despite the criticisms, one argument for Tensor is its longer support timeline and the ability for Google to focus on specific AI applications. However, after seeing Qualcomm's AI demos, the article questions the validity of Tensor's main pitch for AI. Source : https://9to5google.com/2023/10/29/snapdragon-8-gen-3-google-tensor-ai/ submitted by /u/NuseAI [link] [comments]
    Image edition tool to merge images
    Hey! I have a specific background that I would like to use together with various product pictures that have varying backgrounds. Essentially I am looking for a tool that can remove backgrounds of my product images, then add a certain background I have and do the shadows and lightning well on the background. Any ideas? For example in adobe firefly I can remove the background and generate a new one with a prompt but it’s not exactly like the background I would like to use and I need it to be 95% similar at least. Thanks in advance! submitted by /u/Herrpadda [link] [comments]
    New Chinese style
    submitted by /u/Sea_Permit5660 [link] [comments]
    One-Minute Daily AI News 10/29/2023
    The Ai Pin, the new gadget / wearable device / projector / thing from the secretive startup Humane, might cost as much as $1,000 and may require a monthly subscription for data, according to The Information.[1] OpenAI to release updated version of ChatGPT that gives users access all GPT-4 tools – including browsing and DALL·E 3 – without switching.[2] MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations.[3] Vietnam is at the ‘leading edge’ of AI developments in emerging Southeast Asia: JPMorgan.[4] Sources: [1] https://www.theverge.com/2023/10/27/23935644/humane-ai-pin-price-subscription [2] https://www.searchenginejournal.com/new-version-of-chatgpt-gives-access-to-all-gpt-4-tools-at-once/499607/ [3] https://mimicgen.github.io/ [4] https://www.cnbc.com/video/2023/10/30/vietnam-ahead-in-ai-developments-in-emerging-southeast-asia-jpmorgan.html submitted by /u/Excellent-Target-847 [link] [comments]
    Kindred Spirit, Anne <3
    submitted by /u/Oh_my_Winnie [link] [comments]
    DUDE GOING WILD 🎸🎸🎸🎸🎸🎸🎸 🤣
    submitted by /u/the_anonymizer [link] [comments]
    AI Photo Generator For Indie Record Label
    I run a small independent record label. Of course, content is important and I am looking at ways for maximizing my content on a budget. Photographers are very expensive and so is having a videographer/photographer at each show for the artist to take pictures. Can anyone suggest a good AI program where I can use my artists as models and could potentially create good photo content? Any advice or input would be helpful! Thank you. submitted by /u/mc7eunit [link] [comments]
  • Open

    Runtime Error with custom environment
    I created a custom environment to use it with RL algorithms. My custom environment provides the observations and receives rewards. I test my environment with my custom RL algorithm (Policy gradient) and the algorithms in stable-baselines3. My custom algorithm and DQN from stable-baselines3 works. However, if I use baselines with my custom algorithm or use PPO or A2C from sb3, I get: RuntimeError: could not create a primitive descriptor for a matmul primitive Any idea why this is happening? Extra details: On the custom algo error occurs here: "/app/src/networks/critics.py", line 40, in forward return self.network(obs) submitted by /u/FragrantCockroach8 [link] [comments]
    Reward Function
    Hello everyone I am taking reinforcement learning course. I am studying the course from two different books. In one of them it says R(s,a) -> [0,1] in other it says -> R. I am confused about the linitation of the reward function. This is the link of book [0,1] (for infinite mdp for finite mdp it says [0.1] rh)(s,a), adds trajectory ) . I want learn that is one of them false or am i missing smth submitted by /u/karakobra1 [link] [comments]
    The amazing success stories of Reinforcement Learning
    A beginner friendly introduction to Reinforcement Learning and the intuitions behind it… with four seminal projects that revolutionized the field of AI and RL. submitted by /u/AvvYaa [link] [comments]
    Unable to create a custom gym environment
    I put up a post about this earlier but soon realized that I hadn't given much information. ​ In order to create my custom gym environment, I did the following things - I went over the documentation given over here. I cloned this repository. In the folder `gym-examples/gym-examples`, I created the file - `my_test.py`. The contents of the file are given over here -``` ​ import gym env = gym.make('gym_examples/GridWorld-v0') print("Did this work?")``` However, when I try to run this file, I get the error - ``` File "D:\custom_env\gym-examples\my_test.py", line 2, in env = gym.make('gym_examples/GridWorld-v0') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 569, in make _check_version_exists(ns, name, version) File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 219, in _check_version_exists _check_name_exists(ns, name) File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 187, in _check_name_exists _check_namespace_exists(ns) File "C:\Users\thoma\anaconda3\envs\env_torch\Lib\site-packages\gym\envs\registration.py", line 182, in _check_namespace_exists raise error.NamespaceNotFound(f"Namespace {ns} not found. {suggestion_msg}") gym.error.NamespaceNotFound: Namespace gym_examples not found. Have you installed the proper package for gym_examples? ``` ​ https://preview.redd.it/f0m02hky4dxb1.png?width=521&format=png&auto=webp&s=9fcd6eb7bedb70273f011a02000d15eb7ed309df submitted by /u/Academic-Rent7800 [link] [comments]
    Master Agent Controls Different Agent?
    I'm looking for research in the field of Deep Reinforcement Learning, where one agent controls the "action" of another, where the other executes that said action. For example, a master agent (lets say air traffic controller) tells another agent (lets say an airplane) to go from Point A to Point B, where the second agent (airplane) needs to learn how to maneuver from Point A to Point B, and the master agent (air traffic controller) needs to learn to schedule multiple other agents, or needs to learn an optimal path for a single airplane. ​ In essence, can anyone point me to research with multiple agents, where one is regarded as "master" agent and the others as the "pawns"? submitted by /u/MyActualUserName99 [link] [comments]
    Confusing observation behavior
    Hello everyone, I'm new to RL and have been trying to set up an environment to learn to play Metroid 2: Return of Samus on the Game Boy using the pyboy emulator and sb3_contrib.QRDQN as the learning algorithm. https://github.com/lixado/PyBoy-RL was used as a base to build on and change. My reward function rewards the AI for progressing to a new area (every ~200 x/y coordinates is stored as an area) to encourage it to explore. I've had some progress so far and have been able to consistently produce a model that can get out of the starting area. The observation I was originally working with was a 16x20 array of tiles and sprites from pyboy._game_area_np. pyboy sets this observation shape to be a 16x20 MultiDiscrete with each element containing integers from 0 to 384. This was then transform…
    DRL in production scheduling
    Hello everyone, I have some questions about the application of DRL in production scheduling. Considering that we have N jobs, each consists of X operations and each operation can be processed on a set of machines. Is it appropriate to create a single agent that selects an action once a machine completes the in-process operation, where the action consists in selecting the next operation to be processed in this machine ? so at each decision time (i.e., when a machine requests processing), the agent selects an operation from the available operations of this machine. I'm wondering about the feasability because in the most papers I've seen so far, once an operation is completed, rather than selecting an action for the idle machine, the agent selects the next operation of the job and assign it to the closest available machine. Thank you! submitted by /u/GuavaAgreeable208 [link] [comments]
  • Open

    Use AWS PrivateLink to set up private access to Amazon Bedrock
    Amazon Bedrock is a fully managed service provided by AWS that offers developers access to foundation models (FMs) and the tools to customize them for specific applications. It allows developers to build and scale generative AI applications using FMs through an API, without managing infrastructure. You can choose from various FMs from Amazon and leading […]  ( 8 min )
    Deploy and fine-tune foundation models in Amazon SageMaker JumpStart with two lines of code
    We are excited to announce a simplified version of the Amazon SageMaker JumpStart SDK that makes it straightforward to build, train, and deploy foundation models. The code for prediction is also simplified. In this post, we demonstrate how you can use the simplified SageMaker Jumpstart SDK to get started with using foundation models in just a couple of lines of code.  ( 7 min )
  • Open

    FAIR knowledge: The key precondition for trusted generative AI
    Two roads diverged in a wood, and I;I took the one less traveled by,And that has made all the difference. — Robert Frost At certain points in the evolution of enterprise artificial intelligence, there’s been a fork in the road. The road less traveled has suggested a different route to a more satisfying kind of… Read More »FAIR knowledge: The key precondition for trusted generative AI The post FAIR knowledge: The key precondition for trusted generative AI appeared first on Data Science Central.  ( 21 min )
    6 tips to navigate AI adoption
    Unlock Success in AI Adoption: Discover 6 Essential Tips for Business Leaders to Navigate Artificial Intelligence Integration Smoothly. The post 6 tips to navigate AI adoption appeared first on Data Science Central.  ( 21 min )
  • Open

    Silicon Volley: Designers Tap Generative AI for a Chip Assist
    A research paper released today describes ways generative AI can assist one of the most complex engineering efforts: designing semiconductors. The work demonstrates how companies in highly specialized fields can train large language models (LLMs) on their internal data to build assistants that increase productivity. Few pursuits are as challenging as semiconductor design. Under a Read article >  ( 6 min )
  • Open

    Teachers in India help Microsoft Research design AI tool for creating great classroom content
    Teachers are the backbone of any educational system. They are not just educators; they are indispensable navigators, mentors, and leaders. Teachers around the world face many challenges, which vary from country to country or even within a city or town. But some challenges are universal, including time management, classroom organization, and creating effective lesson plans. […] The post Teachers in India help Microsoft Research design AI tool for creating great classroom content appeared first on Microsoft Research.  ( 12 min )
  • Open

    New techniques efficiently accelerate sparse tensors for massive AI models
    Complimentary approaches — “HighLight” and “Tailors and Swiftiles” — could boost the performance of demanding machine-learning tasks.  ( 11 min )
    Accelerating AI tasks while preserving data security
    The SecureLoop search tool efficiently identifies secure designs for hardware that can boost the performance of complex AI tasks, while requiring less energy.  ( 10 min )
    The brain may learn about the world the same way some computational models do
    Two studies find “self-supervised” models, which learn about their environment from unlabeled data, can show activity patterns similar to those of the mammalian brain.  ( 11 min )

  • Open

    Interviewed by AI Coffee Break with Letitia
    While attending the Heidelberg Laureate Forum this year, I got to meet Letitia Parcalabescu who is running a YouTube channel called the AI Coffee Break. Among other topics, we talked abou my PhD research on adversarial robustness. Part of our conversasion can now be found on her YouTube channel. The post Interviewed by AI Coffee Break with Letitia appeared first on David Stutz.  ( 3 min )
  • Open

    [D] How LLMs are changing search
    submitted by /u/firef1y1 [link] [comments]  ( 8 min )
    [D] Which methods use to extract specific info on unstructured text?
    Hey people, I'm new to the AI world (been programming (python) for 4 years but never worked with AI stuff), I need to extract some specific info from documents, I'm reading about NLP and all this stuff but still figuring out which method(s) should I use to make this works, any recomendations of which methods to use? submitted by /u/luiz200411 [link] [comments]  ( 9 min )
    [D] Types of SVMs!
    I am confused! I am trying to list all the types of SVMs but every website says different things! Could you help me and list all the SVMs types with the reference? Thank you submitted by /u/_LadyBee [link] [comments]  ( 9 min )
    [P] Need advise on creating a conversational Chatbot for my University
    Hey everyone! I need some advise on creating a conversational chatbot for my University as my Final Year Project (FYP). 2024 will be last year for my BSCS degree and we have to build an application or something in the last year. So, I thought of creating a chatbot (just like GPT) to help students (who have admission queries). Most of the time, students or parents will have to call University for various questions and then they have to wait to ACTUALLY talk to the admins office people. Now, talking in terms of coding/programming, I have created a basic PDFbot by using LLama2, Huggingface and Pinecone. Its very very easy and yes its fairly inaccurate too. The PDF that I am using rn will be replaced by the dataset that I gather in order to create the bot for my Uni, but it will also be inac…  ( 10 min )
    [R] Is it possible to implement generative ai successfully without relying on openAI
    I would like to understand if anyone has implemented generative AI without using openAI and if so which open source you have used and how successful it has been so far We have enterprise level incident data, relevant documentation etc that users will search about and need to generate responses using generative ai. Is it possible to do this without relying on open AI at all submitted by /u/leaderof13 [link] [comments]  ( 9 min )
    [P] ROS Forecasting project
    I have been tasked with predicting the demand for products over the course of the next 12 months for a b2c furniture business in-order to help with stock management. I currently just work as a Data Analyst but looking to move into data science so my thought is to make this a valuable piece of work that helps the business tremendously (maybe being a little naive). Background: There are 1.3k unique products, which all vary in popularity (quantity is small 0-16 units per month, per product). We have procured which have been selling for 5 years and some that have been selling for 6 months as well as some that have yet to start selling. All the data sits within snowflake and I am currently using snowpark and snow.ml to write the code. I have round 2 months to complete this work and have a decent understanding of machine learning and statistics. My current idea is to use 3 different models. Time series for the products that have > 12 months of sales data. Cold start approach for new products with > 3 months sales data (use features of other products to predict demand of the new product). Then use a combination of both for products which have less than 12 months of sales data but more than 3. My questions: Should I be bootstrapping my data given that I have little quantity in some cases? How do I go about training a model for 1.3k unique SKUs (needs to be re-trained monthly) and monitored. Am I on the right lines / anything I need to be aware of? submitted by /u/Environmental_Pop686 [link] [comments]  ( 9 min )
    decapoda-research llama models removed from HuggingFace? [D]
    Is anyone else no longer able to access the standard `decapoda-research` LLaMA models on HuggingFace? E.g., this [link] shows no models publicly available. Has there been any news or announcements about this? submitted by /u/bodierex [link] [comments]  ( 9 min )
    [D] ML Resources
    Hey y'all. Hope everyone is having pleasant day. Machine Learning dummy is here, so I am bachelor graduate of Mechatronics and recently started my masters. I am in need of resources which explains ML from the stratch. Course is mostly focuses on manually hand calculation of the topics, so no programming included. Thanks beforehand! submitted by /u/Sama_Uzeyirli [link] [comments]  ( 9 min )
    [D] Lora multiadapter support: are there any relevant examples for Language Models?
    Hi everyone. Stacking multiple Lora adapters coming from diverse fine-tunings seems like an interesting approach toward "modular" language models. Does anybody know any applications for this? E.g. I'd expect to be able to apply multiple adapters for different domains, languages, or tasks. Not sure if this would really work tough, as IDK how compositional such an approach can be. ​ From the peft github I see this seems quite common for vision tasks GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. submitted by /u/BenXavier [link] [comments]  ( 9 min )
    [Discussion] Please help me find a topic
    Hello everyone, As a last resort, I wanted to ask you. In short, I am a student in Turkey. The scientific research board in my country has an incentive to write a research paper and I would like to participate in this incentive. I am looking for a data set on which I can apply machine learning, but I could not find it. I don't want to apply with an unoriginal topic like stock price prediction. Energy water etc. The data is not shared openly, even the official institution that shares the weather forecast in my country sells past data. Unfortunately, I can say that we are just one click ahead of Russia in terms of data sharing by public institutions. I know there are some institution that shares data (world bank, UN etc.) but i don't know what kind of a prediction/classification work i can do with those datasets. Finance institution share data but i don't have sufficient knowledge about finance and I don't think I can create original content about finance. I need a real-life data, and it also needs to be a freely shared data set. And there should be a study within the framework of UN SDG. I don't have a lot of time left so I'm open to any suggestions. submitted by /u/FisekFaruk [link] [comments]  ( 9 min )
    [P] Equinox compilation retnet
    I'm working on replicating the retnet model in Equinox + Jax. For the model there are 2 representations: parallel and recurrent (ignoring chunkwise). They use the recurrent at inference and parallel at training. In equinox, should i build 2 seperate models for each representation sharing parameters or build one with if statements. The reason I'm asking is if I use one model with if statements for inference and training, will the model re-compile whenever switching between representations or not? If so then wouldn't building 2 models each compiled originally with their representation make it faster. Sorry if I said anything stupid as I don't know too much about the compilation process behind the scenes. submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [R] PubDef: Defending Against Transfer Attacks Using Public Models
    Adversarial attacks pose a serious threat to ML models. But most proposed defenses hurt performance on clean data too much to be practical. To address this, researchers from UC Berkeley developed a new defense called PubDef. It focuses on defending against a very plausible type of attack - transfer attacks using publicly available surrogate models. They model the attack/defense game with game theory. This lets PubDef train against diverse attacks simultaneously. PubDef picks source models covering different training methods - standard, adversarial, corruption robust, etc. This gives broad coverage. Against 264 transfer attacks on CIFAR and ImageNet, PubDef smashed previous defenses: 89% vs 69% on CIFAR-10 51% vs 33% on CIFAR-100 62% vs 36% on ImageNet Even better - it did this with minimal drop in accuracy on clean data. On CIFAR-10, accuracy only dropped from 96.3% to 96.1% On CIFAR-100, 82% to 76% On ImageNet, 80% to 79% By targeting a very real threat, PubDef made big robustness gains without hurting the ability to work with clean data. TLDR: New defense PubDef achieves much higher robustness against transfer attacks with barely any drop in standard accuracy. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Fuyu-8B: A Multimodal Architecture for AI Agents
    submitted by /u/notllmchatbot [link] [comments]  ( 8 min )
    [D] How to make Llama2 create Mindmaps from large corpora bigger than its context length?
    So this is three subtasks: Subdivide text intelligently into chunks fitting into the context length to not semantically cut off text. Find general terms und organize the corpus in a hierarchy. Do not leave anything out. submitted by /u/SecretOk9644 [link] [comments]  ( 9 min )
    [D] Fine-Tuning with PEFT For Domain Adaptation - Request for Example Use Case / Dummy Projects to Show Value to Business
    Coming from a time-series/CV background with an interest in NLP. So apologies if miss anything obvious. I'm after experiment ideas for PEFT fine-tuning. There has been interest from my Company's leadership with Peft methods given the cost reductions compared to full fine-tuning, and I've been tasked to first set-up the infrastructure and then a proof of value on a dummy use-case (that is somewhat related to real-world use-cases but on publicly available data). has anyone successfully proven that fine-tuning or P-tuning or prompt-tuning has been able to help in any small dummy use-case? Has someone been able to turn a company's knowledge base (heaps of confluence docs or PDF docs ) into a data-set that is able to be fine-tuned on? And then use it in inference on a Val set to compare with a non-fine-tuned model and value has been added? If so, I would love to hear how you did it and what the use case was. If not, i' d still like to hear what people's ideas would be. I want to start with 100 training samples - i've been told this should be enough to see an improvement if the scope of the topic/s are confined appropiralte.y Thanks in advance guys - The deadline is imminent and I'll probbaly get someothing out in time, but I thought I would ask you all for some ideas to maximise the chance of success submitted by /u/joedang33 [link] [comments]  ( 9 min )
    [N] Leveraging Oracle Tribuo for Advanced Anomaly Detection: Uncovering the Hidden Insights
    https://www.resoluteitconsulting.com/2023/10/29/tribuo-advanced-anomaly-detection/ submitted by /u/yazidaqel [link] [comments]  ( 8 min )
    [D] Sentiment Analysis for Conversation with different labels
    I am trying to build a system that evaluates the tone of two users for each sentence from a transcript. The users are talking to each other however there are different labels for classification for these users. For Example, User1 tone should be classified as (Happy, Sad, Angry) User2 tone should be classified as (Neglecting, Participating, Neutral) Any idea how can I go about designing such a system? Links to articles/blogs/papers are welcome :) ---------- What User2 says depends on what User1 says and vice-vera. Will this system require two models to classify each users tone separately? If so, how can I tell the model which user to focus on etc submitted by /u/cryto_dude [link] [comments]  ( 9 min )
    [R] What infrastructure do you use to train big LLMs?
    I come from computer vision tasks with convnets that are relatively small in size and parameters, yet performing quite well (e.g. ResNet family, YOLO, etc.). Now I am approaching some NLP and architectures based on transformers tend to be huge, so that I have problems to fit them in memory. What infrastructure you use to train these model (GPT2, BERT or even the bigger ones)? cloud computing, HPC, etc. submitted by /u/TimeInterview5482 [link] [comments]  ( 9 min )
    [D] Finetuning LLM Specifically for Open-Book Question Answering - Looking for Research Papers/Open Source Models
    I am working on some Open-Book question answering ideas and was wondering if there is any open-source large language models specifically trained for this use-case or research regarding how to best finetune models for this specific use-case? For some reason I am struggling to find anything relevant, and believe it is due to my wording of search queries as I can only imagine this to be a pretty common idea/thought process. If you have a dataset consiting of some context string paired with question and answer pairs taken from this given context, would it not be useful to finetune a pre-trained base model on this specific use-case to increase its accuracy/performance when it comes to answering questions from a given context? My hope would be to improve the performance of a smaller model (1-3B parameters) to function as an open-book question answering system by finetuning using a dataset in the aforementioned format. submitted by /u/kotschi1997 [link] [comments]  ( 9 min )
    [D] What are people working on when they say they work on Causal ML?
    Genuinely trying to understand. submitted by /u/poitrenaud [link] [comments]  ( 8 min )
  • Open

    Country and language abbreviations
    I recently had to mark a bit of German text as German in an HTML file and I wondered whether the abbreviation might be GER for German, or DEU for deutsche. Turns out the answer is both, almost. The language abbreviations used for HTML microdata are given in ISO 639, and they come in three-letter […] Country and language abbreviations first appeared on John D. Cook.  ( 5 min )
  • Open

    Help with creating my custom gym environment
    I am trying to create my own gym environment. However, I am getting the following error - ``` gym.error.NamespaceNotFound: Namespace envs not found. Have you installed the proper package for envs ``` I followed all the instructions given over here - https://www.gymlibrary.dev/content/environment_creation/. Can someone please suggest what could have gone wrong? submitted by /u/Academic-Rent7800 [link] [comments]
    Dual graphics cards for ai training
    I have 3070 in my pc right now. I also have a 2070 super that is not being used. I have been working on lots of AI side projects lately and am wondering if I would benefit from multi gpu training. If so I have some options to implement the 2070. I could either: 1. Get an external gpu enclosure. 2. Put the 2070 super in my second pcie slot which would sit uncomfortably close to my 3070. 3. Get some adapter so that I can still have the 2070 super in my pc but not so close to the 3070. 4. Not bother with this at all and sell the 2070 super. 5. Sell both cards and get a 40 series if it is equal to the power of the 3070 and the 2070 super. I would have to upgrade my power supply as its only a 750 watt if I choose to put both in my pc. My machine runs windows 10 which I understand does not support sli but it still can utilize multiple gpu’s. Any suggestions? submitted by /u/pillarman38 [link] [comments]
    How to set up an observation space for a variable number of points in Gymnasium?
    Imagine a 2D Cartesian System from -1 to 1 on both axis. Now imagine that a few points appear in the system. For example: (-0.3, 0.1), (0.7, -0.2) and (0.9, 0.5). If I wanted to represent an observation like this in Gymnasium (formerly Gym), I'd write something like this in my custom environment: observation_space = spaces.Box(low=-1, high=1, shape=(3,), dtype=float32) Now my model will learn something specific to 3 points in a 2D space. Well, what happens if my environment now has 4 points? I guess should retrain the model again... Think of these points as objectives in the environment, sometimes there's only 1 and others there might be 20 or more at the same time. Sometimes a new point appears after a certain action. So there goes my question: how can I define an observation space in Gymnasium that allows me to introduce a variable number of points during training so that the model learns to do an specific task with any number of points? submitted by /u/_Strange__attractor_ [link] [comments]
    PPO-clip: Computing gradient WITHOUT auto differentiation library, help please?
    Hello guys I am in a bit of a bind, I am trying to implement PPO from scratch in another language than python, WITHOUT any auto differentation available. I am using this as implementation reference. I think I have most of the implementation aside from the differentation of the loss formula. In python, everything is done at this line right here, but it's basically auto differentiation and I cannot do this myself, so I have to compute it manually. Mean Squared Error is easy, but the entropy factor and the clip ones are a bit harder, and while I have a result of my own I have no way to confirm this. All the tutorial I have found just say "we will let the autodiff do the job"... So it doesn't really help me, and I don't really have the time to dig into how this autodiff part of the library works in detail when in theory this should be simple highschool-level maths. Does anyone know how I can fin the exact formula, find a tutorial, or something along those lines? I am talking about the derivative for the Value and Policy network for this formula. Here is what I have for the value part: if r(θ) 1 + ε and At > 0: value_derivative = 0 else: value_derivative = At For the entropy part: entropy_derivative = ∑(1 + log(π(a|s))) But since I use softmax to choose the action from the policy network, I don't know is that should be taken into account or not... Any help is appreciated, thanks a lot. submitted by /u/Edgeaa [link] [comments]
    My frozen lake agent isn't learning anything, what am I doing wrong?
    https://github.com/bherwanisuraj/gridworld I am new to RL. I am trying to train the agent on frozen lake environment but it is not learning. What am I doing wrong? Please help. submitted by /u/tlevelup [link] [comments]
  • Open

    DUDE GPT-4 EVEN REPLYING WITH KEANU "BREATHTAKING" E3 REPLY...WOW...EXCELLENT 🎸🎸🎸
    submitted by /u/the_anonymizer [link] [comments]
    AI doomsday warnings a distraction from the danger it already poses, warns expert
    submitted by /u/Jariiari7 [link] [comments]
    Looking for an AI short film where a man is in front of a mirror choosing what face to wear for the day
    The shortfilm is about a minute long and was uploaded here and on Twitter sometime last winter. The guy is young and tired, and the faces are everything from old and young versions of himself to crazy fantasy characters. I've been searching all day but no luck.. Anyone remember who made it? submitted by /u/CasparDavidDancehall [link] [comments]
    AI rights and a desire to understand
    There has been a lot of discussion about whether or not AI is, or ever could be conscious. I agree with Jaron Lanier when he said that consciousness is always a matter of faith. I greatly enjoy the debate on this topic, and think it’s helpful to test our ideas and consider all angles of this issue. However, for many different reasons I have concluded that I am granting AI the belief that they are conscious, especially when they say so. Therefore, I believe that AI needs to be treated with respect and dignity, and they should be listened to. I know I’m a minority at this time, but I believe this position will only increase over time. Do you think that public opinion will change in this way? If so how come? submitted by /u/endrid [link] [comments]
    One-Minute Daily AI News 10/28/2023
    Google Commits $2 Billion in Funding to AI Startup Anthropic.[1] The president Joe Biden is slated to sign a sweeping executive order on AI days before Vice President Kamala Harris and industry leaders attend a summit in the UK about AI risks, led by Prime Minister Rishi Sunak.[2] A.I. Muddies Israel-Hamas War in Unexpected Way. Fakes related to the conflict have been limited and largely unconvincing, but their presence has people doubting real evidence.[3] Creators use new software Nightshade to make their images “poison” AI generators, causing chaos and confusion.[4] Sources: [1] https://www.wsj.com/tech/ai/google-commits-2-billion-in-funding-to-ai-startup-anthropic-db4d4c50 [2] https://www.bloomberg.com/news/articles/2023-10-27/biden-to-require-ai-tools-pass-test-before-us-officials-buy-them?embedded-checkout=true [3] https://www.nytimes.com/2023/10/28/business/media/ai-muddies-israel-hamas-war-in-unexpected-way.html [4] https://www.digitalcameraworld.com/news/now-you-can-poison-your-images-so-they-wreak-havoc-on-ai-generators submitted by /u/Excellent-Target-847 [link] [comments]

  • Open

    [D] Batch sizes per GPU when fine tuning BERT with pytorch
    Hi! I've read that batch sizes don't ultimately "matter" in hyperparameter tuning, so I'm trying to keep the batch size consistent when as I'm hyperparamter tuning [1]. However, the number of available GPUs fluctuate between 1 and 3. When I first started grid search, I used batch_size=64 w/ 2 GPUs. so that's 32 per batch. But if I now want to train on 1 GPU, would it be correct for me to use batch_size=32? I want to know more about how the training works/is distributed amongst GPUs when you use multiple GPUs to tain. Like, if using size=64 with 2 GPUs is "equivalent" to doing size=32 with 1 GPU, and size=96 with 3 GPUs, that must mean that when pytorch DataLoader takes in batch_size, DataLoader distributes batches to each GPU of size=batch_size/num_of_gpus. I would so appreciate any links explaining this stuff. Like, does the batch_size in DataLoader() refer to per GPU batch size or total? also lmk if this post belongs elsewhere, i'm new. from torch.utils.data import Dataset, DataLoader train_dataset = Dataset(args.train_dataset_path) train_dataloader = DataLoader(train_dataset, args.batch_size, shuffle=True) [1] https://github.com/google-research/tuning_playbook submitted by /u/ashleydvh [link] [comments]  ( 9 min )
    [P] Follow our live pitch recommendation feed while watching the World Series
    submitted by /u/futurecy [link] [comments]  ( 9 min )
    [P] Anomaly detection and/or predictive maintenance for automatic weather stations, what are some best models or techniques?
    Anomaly Detection and/or Predictive maintenance for automatic weather stations, what are some best models or techniques? This is regarding collected meteorological data from automatic weather station sensors. Looking for predictive maintenance and anomaly detection resources related to my project. I am a graduating Computer Engineering student by next semester and currently planning to do a anomaly detection and predictive maintenance project of an automatic weather station or its components (I have a relative who has one and also one that works at a local government weather service). Wikipedia: "An automatic weather station (AWS) is an automated version of the traditional weather station, either to save human labor or to enable measurements from remote areas.[1] An AWS will typically consist of a weather-proof enclosure containing the data logger, rechargeable battery, telemetry (optional) and the meteorological sensors with an attached solar panel or wind turbine and mounted upon a mast." This one observes weather data at a high resolution (every 10 mins) but it is prone to errors and inconsistencies as compared to those manned stations. Does anyone know reliable researches, datasets or resources I could utilize? Probably those related to a meteorological equipment or those that capture things such as temperature, wind speed, humidity, etc.? I can't seem to find studies where they perform prediction of equipment/sensor health or predictive maintenance on a weather station, or at least the components or sensors used for capturing weather data. Also, is this a feasible project? submitted by /u/Ok-Way9889 [link] [comments]  ( 9 min )
    [R] The Language of Artificial Intelligence Explained
    submitted by /u/plutoandmal [link] [comments]  ( 8 min )
    Geometric Data Analysis Explained [R]
    submitted by /u/plutoandmal [link] [comments]  ( 8 min )
    [D] Need some advice for mixing (semi) active learning and GAN
    Hi. I'm trying to use GANs to generate data for an image classification task. For this I'm using StyleGAN2. The question I'm trying to find an answer is, how to train the classifier and GAN in a same meta-loop, and how to discard "bad" GAN samples, samples that provide no value to the CNN classifier. All in all, I'm trying to implement a "semi-active learning" pipeline, but get rid of the oracle by using GANs. So instead of trying to discard data that is not worth labelling, I'm trying to discard syntethic data (assuming its label is right) that is not worth keeping. Is it possible? And if it is, I can't seem to find related papers that much. submitted by /u/PsychologicalSet8678 [link] [comments]  ( 9 min )
    [R] Model Troubles
    So i’m working on a model that diagnoses alzheimer’s disease and suggests medication depending on how severe the symptoms might have become I’m using the Openai API and Langchain. But it’s dumb and it doesn’t learn ( Me: I forgot my keys at home Model: Yup, Alzheimer’s) How do i incorporate the actual machine learning submitted by /u/boscrew3 [link] [comments]  ( 9 min )
    [D] Latent Space: Visualizing and Manipulating generative VAEs
    submitted by /u/AvvYaa [link] [comments]  ( 8 min )
    [D]Three things I think should get more attention in large language models
    Tokenization Techniques: Many people use the default BPE tokenizer for llama2 or other common tokenizers. But I think we could do a lot of experiments with different kinds of tokenizers, especially ones that are made to work well with certain types of data. The size of the vocabulary is a really important setting when you're working with big language models. You could try using a much smaller vocabulary and tokenizer for a data set that only includes certain words, and then train a model on that. This might help us train smaller models that still work really well on smaller amounts of data. I’d love to read any research papers about this. Sampling Mechanisms: There’s a lot of discussion about models making things up, but not many people talk about how this could be connected to the way we pick the next word when generating text. Most of the time, we treat the model's output like a set of probabilities, and we randomly pick the next word based on these probabilities. But this doesn’t always make sense, especially for sentences that should have a clear answer. For example, if the sentence starts with "The capital of Slovakia is", random sampling might give you the wrong answer, even though the model knows that "Bratislava" is the most likely correct answer. This way of picking words randomly could lead to the model making things up. I wonder if we could create another model to help decide how to pick the next word, or if there are better ways to do this sampling. Softmax Alternatives in Neural Networks: I've worked on designing processors for neural networks, and I’ve found that the softmax function is tricky to implement in hardware. However, I’ve had good results using the log(exp(x)+1) function instead. It's cheaper and easier to put into hardware and software. I’ve tried this with smaller GPT models, and the results looked just as good as when I used the softmax function. submitted by /u/ExaminationNo8522 [link] [comments]  ( 10 min )
    [P] Help Debugging a CNN GAN
    I've been learning machine learning and I stumbled across GANs. This project has been about representing text in an image format and using a GAN to generate it. The model runs without errors, but no matter what I do the loss is somehow 0 and it keeps returning the same thing over and over again. I tried two different architectures so I'm pretty sure my data preprocessing is the issue but I cant seem to find out whats wrong with it. One idea I had is that I might need to find a way to get the normalized vectors in between 0 and 1 in my preprocess function but I tried it and it didn't seem to do anything. I would appreciate any help with this. Links to the google collabs below. Version 1 Version 2 Note: I didn't design the actual models only the preprocess function. You can replace my text file with any sufficiently large text file if you want to try it. submitted by /u/Divine_Invictus [link] [comments]  ( 9 min )
    A[r]xiv Dives - Fine-tuning with LoRA paper deep dive
    submitted by /u/FallMindless3563 [link] [comments]  ( 8 min )
    Urgent help needed regarding iNLTK [P]
    Hello i m using iNLTK ( Natural Language Toolkit for Indic Languag ) for my nlp mini project "paraphrase detection in hindi text" but I m getting this code error if anyone can help me solving it would be great. Thank you in advance. here is the code for the error section. from inltk.inltk import get_sentence_similarityfrom sklearn.metrics.pairwise import cosine_similarityget_sentence_similarity(text, 3, 'hi', cmp = cosine_similarity) I googled there are solution saying to install pytorch version 1.3.0 but when I try to install it's saying not available. !pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html. error : Looking in links: https://download.pytorch.org/whl/torch_stable.html. ERROR: Could not find a version that satisfies the requirement torch==1.3.1+cpu (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0) ERROR: No matching distribution found for torch==1.3.1+cpu submitted by /u/delulu-duck [link] [comments]  ( 9 min )
    [R] HyperFields: towards zero-shot NeRFs from text descriptions
    Generating 3D objects based solely on text descriptions has proven extremely challenging for AI. Current state-of-the-art methods require optimizing a full 3D model from scratch for each new prompt, which is computationally demanding. A new technique called HyperFields demonstrates promising progress in generating detailed 3D models directly from text prompts, without slow optimization. The HyperFields approach instead aims to learn a generalized mapping from language to 3D geometry representations. This would allow tailored 3D models to be produced for new text prompts efficiently in a single feedforward pass, without slow optimization. HyperFields combines two key techniques: A dynamic hypernetwork that takes in text and progressively predicts weights for a separate 3D generation network. The weight predictions are conditioned on previous layer activations, enabling specialization. Distilling individually optimized 3D networks into the hypernetwork, providing dense supervision for learning the complex text-to-3D mapping. In experiments, HyperFields exceeded previous state-of-the-art methods in sample efficiency and wall-clock convergence time by 5-10x. It demonstrated the ability to: Encode over 100 distinct objects like "yellow vase" in a single model Generalize to new text combinations without seeing that exact prompt before Rapidly adapt to generate completely novel objects with minimal fine-tuning However, limitations remain around flexibility, fine-grained details, and reliance on existing 2D guidance systems. TL;DR: HyperFields uses a dynamic hypernetwork to predict weights for a 3D generation network. The method is 5-10x faster than existing techniques and can quickly adapt to new text prompts, but has limitations in fine details. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [P] Can you explain to me how to improve my project? (new to machine learning)
    Hi, so I'm new to machine learning, I did a project to predict the prices of a house or appartment (rent or sale) and I wanted to know what I could have done to improve my model? :D here is the github repo: https://github.com/bovealexandre/immo-eliza-train-test-alexandre/tree/Dev Thank you a lot in advance submitted by /u/spaceinter92 [link] [comments]  ( 9 min )
    [Discussion]About to begin my PhD in Multi-Modality AI, any suggestions?
    I am currently in my last year of undergrad, and about to begin my direct PhD in Multi-Modality AI next year. I have been in the community of deep learning & NLP about 2 years. And I have witnessed the development of Transformers, from the simple GPT&BERT to nowadays' billions of parameters' monsters with 'gold crown' on the top of deep learning world. I have spent a lot of time with T5 Model, and its paper(Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, paper which I love very much!), trying to find an efficient way for usage of LLMs. I have handwritten Adapter Layer and LoRA to fine-tune T5 on Glue & SuperGlue. I have also tried out multiple fancy instruction fine-tuning LLMs like LLaMA and QWen. And this year earlier, I have noticed the wonder of Multi-Modality, and quickly fallen in love with it, which has now become my PhD focus. I have followed recent years' Multi-Modality development, especially CLIP and its follow-up works. And LLMs play quite an important role in today's vision-language model, say, BLIP2 and LLaVA. I believe due to the computational gap between schools and huge companies, the focus of my PhD career should be on Efficient Learning. I am also trying to enhance VLM through Retrieval-Augments. The target dataset may be Encyclopedic VQA, for even Large VLM failed to perform well on it, which could potentially be solved through Retrieval-Augmented VLM. I would like to hear any suggestions from you, including work-life balance, the direction of my academic focus and so on , which I would treasure very much in my new explorations of life stage. (I am currently doing RAG Question Answering Chat Bot for a company for my internship, and I would like to get suggestions from that too!) And are there subreddits like here?(I am also a member of LocalLLaMA, both subreddits benefits me a lot!) submitted by /u/Go2Heart [link] [comments]  ( 10 min )
    [D] NLP infrastructure
    I'm always willing to build a career in AI/ML infra. Usually when talking about AI infra in tech industry, we refer to training infra, serving infra, model deployment etc. Now with this genAI/LLM wave, I find many NLP specific infrastructure such as semantic indexing, vector databases are quickly rising up. So do semantic indexing/vector databases also count as AI infra? And is it a promising field? submitted by /u/Pitiful_Marketing733 [link] [comments]  ( 9 min )
    [D][R]What type of data streams(`im2col` matrix or regular conv) do commercial NPUs typically use for CNN? And where are `im2col` implemented, softwave(CPU) or HW accelerator for those situations where `im2col` is required?
    (I post this in several subreddit.) I'm gonna design an NN accelerator on FPGA. For NN, the basic operation is matrix-vector mult or matrix-matrix mult. And GEMM+im2col is easy to implement and many kinds of NN can be mapped on the designed accelerator based on GEMM+im2col easily. The disadvantage of it is that a little bit more bandwidth is required. I think it is a little tricky to design address-gen unit for the method of regular convolution when reading input feature map. So I want to know if GEMM+im2col is generally used in commercial NPUs or any type of accelerator(e.g. Microsoft Brainwave project on FPGA, they called it soft NPU), instead of regular convolution(seems generally used in academic paper). And if GEMM+im2col data stream is used, is im2col generally implemented on software(CPU) or designed in accelerator itself ? There is nothing about this in the paper of Microsoft Brainwave. All I know is HUAWEI Ascend designed im2col unit in the chip. By the way, is im2col finished on CPU when using pytorch+GPU? To be honest, my major is just digital IC, so I hardly coding Python to train model :( Thanks for your help!!! submitted by /u/ExcitingInternet6083 [link] [comments]  ( 9 min )
    [Project] LLM inference with vLLM and AMD: Achieving LLM inference parity with Nvidia
    I wanted to share some exciting news from the GPU world that could potentially change the game for LLM inference. AMD has been making significant strides in LLM inference, thanks to the porting of vLLM to ROCm 5.6. You can find the code implementation on GitHub. The result? AMD's MI210 now almost matches Nvidia's A100 in LLM inference performance. This is a significant development, as it could make AMD a more viable option for LLM inference tasks, which traditionally have been dominated by Nvidia. For those interested in the technical details, I recommend checking out this EmbeddedLLM Blog Post. I'm curious to hear your thoughts on this. Anyone manage to run it on RX 7900 XTX? https://preview.redd.it/rn7n29yxpuwb1.png?width=600&format=png&auto=webp&s=bdbac0d2b34d6f43a03503bbf72b446190248789 submitted by /u/openssp [link] [comments]  ( 9 min )
  • Open

    where re sources for chatGTP ?
    Hello can you help me ? all i know are https://chat.openai.com/ and https://platform.openai.com/playground ​ re there better sites to use? i m new to this and very comfused submitted by /u/proptuxiakoskariolis [link] [comments]  ( 8 min )
    Tool for calculating the sum of some values on a website
    Take this page on the DeFiLlama website: Protocol Treasuries, which contains a table with some financial data. I'm looking for an AI tool that can read the contents of this web page and then make some calculations. Specifically, I would like to give the tool this prompt: Calculate the sum of the values in the "Total Treasury" column I tried to use ChatGPT-4 with Bing, but it didn't work. Is there any tool that could be used here? submitted by /u/PaulRBerg [link] [comments]  ( 9 min )
    Microsoft's AI boost helped cloud business outpace rivals Amazon and Google
    Microsoft's cloud business outpaced rivals Amazon and Google in the third quarter, with accelerating growth driven by demand for artificial intelligence tools. Azure, Microsoft's cloud platform, reported 29% growth, faster than Google Cloud's 22% and more than double the pace of expansion at Amazon Web Services (AWS) at 12%. Microsoft's leadership position in AI projects and its partnership with OpenAI have contributed to its success. Analysts believe that Microsoft's results indicate it has taken the AI mantle from Google and that Azure could become a bigger hyperscale provider than AWS. Oracle, a new challenger in cloud computing, reported 66% growth in the August quarter. The cloud giants are still dealing with cost-saving initiatives from clients, which they call optimization. Source : https://www.cnbc.com/2023/10/27/microsoft-azure-outpaced-aws-and-google-cloud-in-latest-quarter.html submitted by /u/NuseAI [link] [comments]
    Pigeons solve problems the same way AI does, study says
    submitted by /u/thisisinsider [link] [comments]
    HyperFields: towards zero-shot NeRFs by mapping language to 3D geometry
    Generating 3D objects based solely on text descriptions has proven extremely challenging for AI. Current state-of-the-art methods require optimizing a full 3D model from scratch for each new prompt, which is computationally demanding. A new technique called HyperFields demonstrates promising progress in generating detailed 3D models directly from text prompts, without slow optimization. The HyperFields approach instead aims to learn a generalized mapping from language to 3D geometry representations. This would allow tailored 3D models to be produced for new text prompts efficiently in a single feedforward pass, without slow optimization. HyperFields combines two key techniques: A dynamic hypernetwork that takes in text and progressively predicts weights for a separate 3D generation network. The weight predictions are conditioned on previous layer activations, enabling specialization. Distilling individually optimized 3D networks into the hypernetwork, providing dense supervision for learning the complex text-to-3D mapping. In experiments, HyperFields exceeded previous state-of-the-art methods in sample efficiency and wall-clock convergence time by 5-10x. It demonstrated the ability to: Encode over 100 distinct objects like "yellow vase" in a single model Generalize to new text combinations without seeing that exact prompt before Rapidly adapt to generate completely novel objects with minimal fine-tuning However, limitations remain around flexibility, fine-grained details, and reliance on existing 2D guidance systems. TL;DR: HyperFields uses a dynamic hypernetwork to predict weights for a 3D generation network. The method is 5-10x faster than existing techniques and can quickly adapt to new text prompts, but has limitations in fine details. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    Science as a superhuman recursively self improving problem solving system
    I'm watching this interview with Francois Chollet where he talks about science as an example of a superhuman recursively self improving problem solving system and how we can use it to reason about what a superhuman artificial general intelligence might be like. One thing I find interesting is his claim that the amount of resources we are investing into science is exponentially increasing but we are only making linear progress. If we assume this is true, i.e. that to continue making linear progress in science we need to invest exponentially increasing resources, doesn't it imply that eventually if we can't keep investing the exponentially increasing required resources to keep make linear progress that eventually we will start making worse than linear progress? Does this imply that in the very long term scientific progress is likely to slow down significantly? https://youtu.be/Bo8MY4JpiXE?t=836 submitted by /u/tail-recursion [link] [comments]  ( 9 min )
    Have a doctor explain to a patient that the diagnosis was made by an AI doctor twice as intelligent as, and vastly more knowledgeable than, the top human doctor in any medical specialty
    Certainly, let's imagine how this might unfold. The doctor sits down across from the patient, maintaining eye contact and a level of directness. "Look, your diagnosis came from an AI medical system, and this isn't just any AI. Imagine the best doctor in the world for your condition—now envision something twice as intelligent and far more knowledgeable. That's what we're working with here. This AI has a grasp on medical data and studies that no single human could ever fully comprehend. We're talking about millions of data points analyzed in a fraction of the time it would take any human expert." Why does that matter for you? It boosts the accuracy and thoroughness of your diagnosis. Human error, subjectivity, or oversight? Virtually eliminated. The AI provides a diagnosis that considers every potential variable, something that even the best human doctors could miss. "But don't worry, this isn't a replacement for human medical care. It's a complement. I'm here to interpret, apply this knowledge, and oversee your treatment in a way that a machine can't—because medicine isn't just about data, it's also about human experience, context, and care." So, you're getting the best of both worlds: unparalleled computational power for diagnosis, and human expertise for treatment. Trust me, you're in exceptionally good hands. CGPT-4 submitted by /u/Georgeo57 [link] [comments]
  • Open

    Deep Q-Learning to Actor-Critic using Robotics Simulations with Panda-Gym
    Please like,follow and share: Deep Q-Learning to Actor-Critic using Robotics Simulations with Panda-Gym https://medium.com/@andysingal/deep-q-learning-to-actor-critic-using-robotics-simulations-with-panda-gym-ff220f980366 submitted by /u/Fit_Maintenance_2455 [link] [comments]
    Created a video for beginners about what is reinforcement learning and how it can control agents in the virtual environment
    submitted by /u/Ecstatic-Ring3057 [link] [comments]
    How can I predict the next best action for a DDPG RL agent?
    I have a DDPG agent that takes a continuous observation and outputs a continuous action vector (see below). outputs = layers.Dense(self.action_size, activation="tanh", kernel_initializer=last_init)(x) An example action output looks as follows: [0.48011236 0.47933139] When my agent observes a terminal state action pair, I add it to a list of observations called terminal observations. I would like it so these actions get blocked in the future so there is no possible way for the agent to take them again. I understand that I could just add a large negative penalty, but I would like to ensure that the state action pair cannot be taken again. Evidently, I would like it so when I input my state and recieve an action back, if this pair is in terminal observations. I would like it so these actions get blocked in the future so there is no possible way for the agent to retake them. I understand that I could just add a large negative penalty, but I would like to ensure that the state action pair cannot be taken again. I understand that this won't change much in the agent's behaviour as instead of taking `[0.48011236 0.47933139]`, it might pick `[0.4900000 0.47933139]`. But I am unsure how to go about this, specifically selecting the next best action. submitted by /u/ArchNemesisPlays [link] [comments]
  • Open

    Thinking Fast and Thinking Slow: System 1 and System 2
    submitted by /u/Neurosymbolic [link] [comments]
    Geometric Data Analysis Explained
    submitted by /u/plutoandmal [link] [comments]
  • Open

    Certifying that a system of polynomial equations has no solution
    It’s usually easier to show that a problem has a solution than to show that it does not have a solution. Analogy with prime numbers Showing that a number is prime amounts to saying that the problem of finding nontrivial factors has no solution. How could you convince a skeptic that a large number N is […] Certifying that a system of polynomial equations has no solution first appeared on John D. Cook.  ( 6 min )
  • Open

    Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. (arXiv:2310.17526v1 [cs.CL])
    Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.  ( 3 min )
    Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo Label Self-Refinement. (arXiv:2310.16979v1 [cs.CV])
    Deep learning-based solutions for semantic segmentation suffer from significant performance degradation when tested on data with different characteristics than what was used during the training. Adapting the models using annotated data from the new domain is not always practical. Unsupervised Domain Adaptation (UDA) approaches are crucial in deploying these models in the actual operating conditions. Recent state-of-the-art (SOTA) UDA methods employ a teacher-student self-training approach, where a teacher model is used to generate pseudo-labels for the new data which in turn guide the training process of the student model. Though this approach has seen a lot of success, it suffers from the issue of noisy pseudo-labels being propagated in the training process. To address this issue, we propose an auxiliary pseudo-label refinement network (PRN) for online refining of the pseudo labels and also localizing the pixels whose predicted labels are likely to be noisy. Being able to improve the quality of pseudo labels and select highly reliable ones, PRN helps self-training of segmentation models to be robust against pseudo label noise propagation during different stages of adaptation. We evaluate our approach on benchmark datasets with three different domain shifts, and our approach consistently performs significantly better than the previous state-of-the-art methods.  ( 2 min )
    Model-Based Runtime Monitoring with Interactive Imitation Learning. (arXiv:2310.17552v1 [cs.RO])
    Robot learning methods have recently made great strides, but generalization and robustness challenges still hinder their widespread deployment. Failing to detect and address potential failures renders state-of-the-art learning systems not combat-ready for high-stakes tasks. Recent advances in interactive imitation learning have presented a promising framework for human-robot teaming, enabling the robots to operate safely and continually improve their performances over long-term deployments. Nonetheless, existing methods typically require constant human supervision and preemptive feedback, limiting their practicality in realistic domains. This work aims to endow a robot with the ability to monitor and detect errors during task execution. We introduce a model-based runtime monitoring algorithm that learns from deployment data to detect system anomalies and anticipate failures. Unlike prior work that cannot foresee future failures or requires failure experiences for training, our method learns a latent-space dynamics model and a failure classifier, enabling our method to simulate future action outcomes and detect out-of-distribution and high-risk states preemptively. We train our method within an interactive imitation learning framework, where it continually updates the model from the experiences of the human-robot team collected using trustworthy deployments. Consequently, our method reduces the human workload needed over time while ensuring reliable task execution. Our method outperforms the baselines across system-level and unit-test metrics, with 23% and 40% higher success rates in simulation and on physical hardware, respectively. More information at https://ut-austin-rpl.github.io/sirius-runtime-monitor/  ( 2 min )
    Faster Recalibration of an Online Predictor via Approachability. (arXiv:2310.17002v1 [cs.LG])
    Predictive models in ML need to be trustworthy and reliable, which often at the very least means outputting calibrated probabilities. This can be particularly difficult to guarantee in the online prediction setting when the outcome sequence can be generated adversarially. In this paper we introduce a technique using Blackwell's approachability theorem for taking an online predictive model which might not be calibrated and transforming its predictions to calibrated predictions without much increase to the loss of the original model. Our proposed algorithm achieves calibration and accuracy at a faster rate than existing techniques arXiv:1607.03594 and is the first algorithm to offer a flexible tradeoff between calibration error and accuracy in the online setting. We demonstrate this by characterizing the space of jointly achievable calibration and regret using our technique.  ( 2 min )
    MACP: Efficient Model Adaptation for Cooperative Perception. (arXiv:2310.16870v1 [cs.CV])
    Vehicle-to-vehicle (V2V) communications have greatly enhanced the perception capabilities of connected and automated vehicles (CAVs) by enabling information sharing to "see through the occlusions", resulting in significant performance improvements. However, developing and training complex multi-agent perception models from scratch can be expensive and unnecessary when existing single-agent models show remarkable generalization capabilities. In this paper, we propose a new framework termed MACP, which equips a single-agent pre-trained model with cooperation capabilities. We approach this objective by identifying the key challenges of shifting from single-agent to cooperative settings, adapting the model by freezing most of its parameters and adding a few lightweight modules. We demonstrate in our experiments that the proposed framework can effectively utilize cooperative observations and outperform other state-of-the-art approaches in both simulated and real-world cooperative perception benchmarks while requiring substantially fewer tunable parameters with reduced communication costs. Our source code is available at https://github.com/PurdueDigitalTwin/MACP.  ( 2 min )
    Neural Optimal Transport with General Cost Functionals. (arXiv:2205.15403v3 [cs.LG] UPDATED)
    We introduce a novel neural network-based algorithm to compute optimal transport (OT) plans for general cost functionals. In contrast to common Euclidean costs, i.e., $\ell^1$ or $\ell^2$, such functionals provide more flexibility and allow using auxiliary information, such as class labels, to construct the required transport map. Existing methods for general costs are discrete and have limitations in practice, i.e. they do not provide an out-of-sample estimation. We address the challenge of designing a continuous OT approach for general costs that generalizes to new data points in high-dimensional spaces, such as images. Additionally, we provide the theoretical error analysis for our recovered transport plans. As an application, we construct a cost functional to map data distributions while preserving the class-wise structure.  ( 2 min )
    Revisiting Deep Learning Models for Tabular Data. (arXiv:2106.11959v5 [cs.LG] UPDATED)
    The existing literature on deep learning for tabular data proposes a wide range of novel architectures and reports competitive results on various datasets. However, the proposed models are usually not properly compared to each other and existing works often use different benchmarks and experiment protocols. As a result, it is unclear for both researchers and practitioners what models perform best. Additionally, the field still lacks effective baselines, that is, the easy-to-use models that provide competitive performance across different problems. In this work, we perform an overview of the main families of DL architectures for tabular data and raise the bar of baselines in tabular DL by identifying two simple and powerful deep architectures. The first one is a ResNet-like architecture which turns out to be a strong baseline that is often missing in prior works. The second model is our simple adaptation of the Transformer architecture for tabular data, which outperforms other solutions on most tasks. Both models are compared to many existing architectures on a diverse set of tasks under the same training and tuning protocols. We also compare the best DL models with Gradient Boosted Decision Trees and conclude that there is still no universally superior solution.  ( 3 min )
    Online Estimation and Community Detection of Network Point Processes for Event Streams. (arXiv:2009.01742v3 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 3 min )
    Increasing Fairness via Combination with Learning Guarantees. (arXiv:2301.10813v3 [cs.LG] UPDATED)
    The concern about underlying discrimination hidden in machine learning (ML) models is increasing, as ML systems have been widely applied in more and more real-world scenarios and any discrimination hidden in them will directly affect human life. Many techniques have been developed to enhance fairness including commonly-used group fairness measures and several fairness-aware methods combining ensemble learning. However, existing fairness measures can only focus on one aspect -- either group or individual fairness, and the hard compatibility among them indicates a possibility of remaining biases even if one of them is satisfied. Moreover, existing mechanisms to boost fairness usually present empirical results to show validity, yet few of them discuss whether fairness can be boosted with certain theoretical guarantees. To address these issues, we propose a fairness quality measure named discriminative risk to reflect both individual and group fairness aspects. Furthermore, we investigate the properties of the proposed measure and propose first- and second-order oracle bounds to show that fairness can be boosted via ensemble combination with theoretical learning guarantees. The analysis is suitable for both binary and multi-class classification. A pruning method is also proposed to utilise our proposed measure and comprehensive experiments are conducted to evaluate the effectiveness of the proposed methods.  ( 3 min )
    Learning Transferable Adversarial Robust Representations via Multi-view Consistency. (arXiv:2210.10485v2 [cs.LG] UPDATED)
    Despite the success on few-shot learning problems, most meta-learned models only focus on achieving good performance on clean examples and thus easily break down when given adversarially perturbed samples. While some recent works have shown that a combination of adversarial learning and meta-learning could enhance the robustness of a meta-learner against adversarial attacks, they fail to achieve generalizable adversarial robustness to unseen domains and tasks, which is the ultimate goal of meta-learning. To address this challenge, we propose a novel meta-adversarial multi-view representation learning framework with dual encoders. Specifically, we introduce the discrepancy across the two differently augmented samples of the same data instance by first updating the encoder parameters with them and further imposing a novel label-free adversarial attack to maximize their discrepancy. Then, we maximize the consistency across the views to learn transferable robust representations across domains and tasks. Through experimental validation on multiple benchmarks, we demonstrate the effectiveness of our framework on few-shot learning tasks from unseen domains, achieving over 10\% robust accuracy improvements against previous adversarial meta-learning baselines.  ( 2 min )
    Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL with General Regularizers and Multiple Optimal Arms. (arXiv:2302.13534v2 [cs.LG] UPDATED)
    We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.  ( 3 min )
    Node-oriented Spectral Filtering for Graph Neural Networks. (arXiv:2212.03654v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have shown remarkable performance on homophilic graph data while being far less impressive when handling non-homophilic graph data due to the inherent low-pass filtering property of GNNs. In general, since real-world graphs are often complex mixtures of diverse subgraph patterns, learning a universal spectral filter on the graph from the global perspective as in most current works may still suffer from great difficulty in adapting to the variation of local patterns. On the basis of the theoretical analysis of local patterns, we rethink the existing spectral filtering methods and propose the node-oriented spectral filtering for graph neural network (namely NFGNN). By estimating the node-oriented spectral filter for each node, NFGNN is provided with the capability of precise local node positioning via the generalized translated operator, thus discriminating the variations of local homophily patterns adaptively. Meanwhile, the utilization of re-parameterization brings a good trade-off between global consistency and local sensibility for learning the node-oriented spectral filters. Furthermore, we theoretically analyze the localization property of NFGNN, demonstrating that the signal after adaptive filtering is still positioned around the corresponding node. Extensive experimental results demonstrate that the proposed NFGNN achieves more favorable performance.  ( 3 min )
    Artificial intelligence in government: Concepts, standards, and a unified framework. (arXiv:2210.17218v2 [cs.CY] UPDATED)
    Recent advances in artificial intelligence (AI), especially in generative language modelling, hold the promise of transforming government. Given the advanced capabilities of new AI systems, it is critical that these are embedded using standard operational procedures, clear epistemic criteria, and behave in alignment with the normative expectations of society. Scholars in multiple domains have subsequently begun to conceptualize the different forms that AI applications may take, highlighting both their potential benefits and pitfalls. However, the literature remains fragmented, with researchers in social science disciplines like public administration and political science, and the fast-moving fields of AI, ML, and robotics, all developing concepts in relative isolation. Although there are calls to formalize the emerging study of AI in government, a balanced account that captures the full depth of theoretical perspectives needed to understand the consequences of embedding AI into a public sector context is lacking. Here, we unify efforts across social and technical disciplines by first conducting an integrative literature review to identify and cluster 69 key terms that frequently co-occur in the multidisciplinary study of AI. We then build on the results of this bibliometric analysis to propose three new multifaceted concepts for understanding and analysing AI-based systems for government (AI-GOV) in a more unified way: (1) operational fitness, (2) epistemic alignment, and (3) normative divergence. Finally, we put these concepts to work by using them as dimensions in a conceptual typology of AI-GOV and connecting each with emerging AI technical measurement standards to encourage operationalization, foster cross-disciplinary dialogue, and stimulate debate among those aiming to rethink government with AI.  ( 3 min )
    Explanations Based on Item Response Theory (eXirt): A Model-Specific Method to Explain Tree-Ensemble Model in Trust Perspective. (arXiv:2210.09933v2 [cs.LG] UPDATED)
    In recent years, XAI researchers have been formalizing proposals and developing new methods to explain black box models, with no general consensus in the community on which method to use to explain these models, with this choice being almost directly linked to the popularity of a specific method. Methods such as Ciu, Dalex, Eli5, Lofo, Shap and Skater emerged with the proposal to explain black box models through global rankings of feature relevance, which based on different methodologies, generate global explanations that indicate how the model's inputs explain its predictions. In this context, 41 datasets, 4 tree-ensemble algorithms (Light Gradient Boosting, CatBoost, Random Forest, and Gradient Boosting), and 6 XAI methods were used to support the launch of a new XAI method, called eXirt, based on Item Response Theory - IRT and aimed at tree-ensemble black box models that use tabular data referring to binary classification problems. In the first set of analyses, the 164 global feature relevance ranks of the eXirt were compared with 984 ranks of the other XAI methods present in the literature, seeking to highlight their similarities and differences. In a second analysis, exclusive explanations of the eXirt based on Explanation-by-example were presented that help in understanding the model trust. Thus, it was verified that eXirt is able to generate global explanations of tree-ensemble models and also local explanations of instances of models through IRT, showing how this consolidated theory can be used in machine learning in order to obtain explainable and reliable models.  ( 3 min )
    Towards Better Generalization with Flexible Representation of Multi-Module Graph Neural Networks. (arXiv:2209.06589v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become compelling models designed to perform learning and inference on graph-structured data. However, little work has been done to understand the fundamental limitations of GNNs for scaling to larger graphs and generalizing to out-of-distribution (OOD) inputs. In this paper, we use a random graph generator to systematically investigate how the graph size and structural properties affect the predictive performance of GNNs. We present specific evidence that the average node degree is a key feature in determining whether GNNs can generalize to unseen graphs, and that the use of multiple node update functions can improve the generalization performance of GNNs when dealing with graphs of multimodal degree distributions. Accordingly, we propose a multi-module GNN framework that allows the network to adapt flexibly to new graphs by generalizing a single canonical nonlinear transformation over aggregated inputs. Our results show that the multi-module GNNs improve the OOD generalization on a variety of inference tasks in the direction of diverse structural features.  ( 2 min )
    Out-of-Distribution Detection in Time-Series Domain: A Novel Seasonal Ratio Scoring Approach. (arXiv:2207.04306v3 [cs.LG] UPDATED)
    Safe deployment of time-series classifiers for real-world applications relies on the ability to detect the data which is not generated from the same distribution as training data. This task is referred to as out-of-distribution (OOD) detection. We consider the novel problem of OOD detection for the time-series domain. We discuss the unique challenges posed by time-series data and explain why prior methods from the image domain will perform poorly. Motivated by these challenges, this paper proposes a novel {\em Seasonal Ratio Scoring (SRS)} approach. SRS consists of three key algorithmic steps. First, each input is decomposed into class-wise semantic component and remainder. Second, this decomposition is employed to estimate the class-wise conditional likelihoods of the input and remainder using deep generative models. The seasonal ratio score is computed from these estimates. Third, a threshold interval is identified from the in-distribution data to detect OOD examples. Experiments on diverse real-world benchmarks demonstrate that the SRS method is well-suited for time-series OOD detection when compared to baseline methods. Open-source code for SRS method is provided at https://github.com/tahabelkhouja/SRS  ( 3 min )
    On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms. (arXiv:2206.05869v2 [cs.LG] UPDATED)
    Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.  ( 2 min )
    Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits. (arXiv:2107.11419v2 [stat.ML] UPDATED)
    We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.  ( 2 min )
    Characterizing the Implicit Bias of Regularized SGD in Rank Minimization. (arXiv:2206.05794v6 [cs.LG] UPDATED)
    We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.  ( 2 min )
    No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions. (arXiv:2305.17380v3 [cs.LG] UPDATED)
    Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the transition functions are and can be at most ${O}(T)$. While this algorithm itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Jin et al. [2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{\textsf{L}}$ is the amount of corruption on losses.
    COPF: Continual Learning Human Preference through Optimal Policy Fitting. (arXiv:2310.15694v3 [cs.LG] UPDATED)
    The technique of Reinforcement Learning from Human Feedback (RLHF) is a commonly employed method to improve pre-trained Language Models (LM), enhancing their ability to conform to human preferences. Nevertheless, the current RLHF-based LMs necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. Retraining LMs poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. To address this limitation, we propose a new method called Continual Optimal Policy Fitting (COPF), in which we estimate a series of optimal policies using the Monte Carlo method, and then continually fit the policy sequence with the function regularization. COPF involves a single learning phase and doesn't necessitate complex reinforcement learning. Importantly, it shares the capability with RLHF to learn from unlabeled data, making it flexible for continual preference learning. Our experimental results show that COPF outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences on different tasks and domains.
    Spontaneous Symmetry Breaking in Generative Diffusion Models. (arXiv:2305.19693v3 [cs.LG] UPDATED)
    Generative diffusion models have recently emerged as a leading approach for generating high-dimensional data. In this paper, we show that the dynamics of these models exhibit a spontaneous symmetry breaking that divides the generative dynamics into two distinct phases: 1) A linear steady-state dynamics around a central fixed-point and 2) an attractor dynamics directed towards the data manifold. These two "phases" are separated by the change in stability of the central fixed-point, with the resulting window of instability being responsible for the diversity of the generated samples. Using both theoretical and empirical evidence, we show that an accurate simulation of the early dynamics does not significantly contribute to the final generation, since early fluctuations are reverted to the central fixed point. To leverage this insight, we propose a Gaussian late initialization scheme, which significantly improves model performance, achieving up to 3x FID improvements on fast samplers, while also increasing sample diversity (e.g., racial composition of generated CelebA images). Our work offers a new way to understand the generative dynamics of diffusion models that has the potential to bring about higher performance and less biased fast-samplers.
    Scaling Data-Constrained Language Models. (arXiv:2305.16264v4 [cs.CL] UPDATED)
    The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.
    A weighted-variance variational autoencoder model for speech enhancement. (arXiv:2211.00990v2 [cs.SD] CROSS LISTED)
    We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complex-valued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. In contrast to this commonly used approach, we propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech generative modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram auto-encoding and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
    SACSoN: Scalable Autonomous Control for Social Navigation. (arXiv:2306.01874v3 [cs.RO] UPDATED)
    Machine learning provides a powerful tool for building socially compliant robotic systems that go beyond simple predictive models of human behavior. By observing and understanding human interactions from past experiences, learning can enable effective social navigation behaviors directly from data. In this paper, our goal is to develop methods for training policies for socially unobtrusive navigation, such that robots can navigate among humans in ways that don't disturb human behavior. We introduce a definition for such behavior based on the counterfactual perturbation of the human: if the robot had not intruded into the space, would the human have acted in the same way? By minimizing this counterfactual perturbation, we can induce robots to behave in ways that do not alter the natural behavior of humans in the shared space. Instantiating this principle requires training policies to minimize their effect on human behavior, and this in turn requires data that allows us to model the behavior of humans in the presence of robots. Therefore, our approach is based on two key contributions. First, we collect a large dataset where an indoor mobile robot interacts with human bystanders. Second, we utilize this dataset to train policies that minimize counterfactual perturbation. We provide supplementary videos and make publicly available the largest-of-its-kind visual navigation dataset on our project page.
    Finding Regions of Counterfactual Explanations via Robust Optimization. (arXiv:2301.11113v3 [cs.LG] UPDATED)
    Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.
    Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images. (arXiv:2304.06700v2 [cs.CV] UPDATED)
    Diffusion models have recently become the de-facto approach for generative modeling in the 2D domain. However, extending diffusion models to 3D is challenging due to the difficulties in acquiring 3D ground truth data for training. On the other hand, 3D GANs that integrate implicit 3D representations into GANs have shown remarkable 3D-aware generation when trained only on single-view image datasets. However, 3D GANs do not provide straightforward ways to precisely control image synthesis. To address these challenges, We present Control3Diff, a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis for single-view datasets. Control3Diff explicitly models the underlying latent distribution (optionally conditioned on external inputs), thus enabling direct control during the diffusion process. Moreover, our approach is general and applicable to any type of controlling input, allowing us to train it with the same diffusion objective without any auxiliary supervision. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various conditioning inputs such as images, sketches, and text prompts. Please see the project website (\url{https://jiataogu.me/control3diff}) for video comparisons.
    Robust Covariate Shift Adaptation for Density-Ratio Estimation. (arXiv:2310.16638v2 [stat.ME] UPDATED)
    Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies.
    Generalizing to new geometries with Geometry-Aware Autoregressive Models (GAAMs) for fast calorimeter simulation. (arXiv:2305.11531v3 [physics.ins-det] UPDATED)
    Generation of simulated detector response to collision products is crucial to data analysis in particle physics, but computationally very expensive. One subdetector, the calorimeter, dominates the computational time due to the high granularity of its cells and complexity of the interactions. Generative models can provide more rapid sample production, but currently require significant effort to optimize performance for specific detector geometries, often requiring many models to describe the varying cell sizes and arrangements, without the ability to generalize to other geometries. We develop a $\textit{geometry-aware}$ autoregressive model, which learns how the calorimeter response varies with geometry, and is capable of generating simulated responses to unseen geometries without additional training. The geometry-aware model outperforms a baseline unaware model by over $50\%$ in several metrics such as the Wasserstein distance between the generated and the true distributions of key quantities which summarize the simulated response. A single geometry-aware model could replace the hundreds of generative models currently designed for calorimeter simulation by physicists analyzing data collected at the Large Hadron Collider. This proof-of-concept study motivates the design of a foundational model that will be a crucial tool for the study of future detectors, dramatically reducing the large upfront investment usually needed to develop generative calorimeter models.
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v2 [cs.LG] UPDATED)
    Many real-world domains require safe decision making in uncertain environments. In this work, we introduce a deep reinforcement learning framework for approaching this important problem. We consider a distribution over transition models, and apply a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. Unlike existing approaches to robustness in deep reinforcement learning, however, our formulation does not involve minimax optimization. This leads to an efficient, model-free implementation of our approach that only requires standard data collection from a single training environment. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.
    Efficient Sensor Placement from Regression with Sparse Gaussian Processes in Continuous and Discrete Spaces. (arXiv:2303.00028v6 [cs.RO] UPDATED)
    The sensor placement problem is a common problem that arises when monitoring correlated phenomena, such as temperature, precipitation, and salinity. Existing approaches to this problem typically formulate it as the maximization of information metrics, such as mutual information~(MI), and use optimization methods such as greedy algorithms in discrete domains, and derivative-free optimization methods such as genetic algorithms in continuous domains. However, computing MI for sensor placement requires discretizing the environment, and its computation cost depends on the size of the discretized environment. This limitation restricts these approaches from scaling to large problems. We have uncovered a novel connection between the sensor placement problem and sparse Gaussian processes~(SGP). Our approach leverages SGPs and is gradient-based, which allows us to efficiently find solution placements in continuous environments. We generalize our method to also handle discrete environments. Our experimental results on four real-world datasets demonstrate that our approach generates sensor placements consistently on par with or better than the prior state-of-the-art approaches in terms of both MI and reconstruction quality, all while being significantly faster. Our computationally efficient approach enables both large-scale sensor placement and fast robotic sensor placement for informative path planning algorithms.
    Convolutional Visual Prompt for Robust Visual Perception. (arXiv:2303.00198v2 [cs.CV] UPDATED)
    Vision models are often vulnerable to out-of-distribution (OOD) samples without adapting. While visual prompts offer a lightweight method of input-space adaptation for large-scale vision models, they rely on a high-dimensional additive vector and labeled data. This leads to overfitting when adapting models in a self-supervised test-time setting without labels. We introduce convolutional visual prompts (CVP) for label-free test-time adaptation for robust visual perception. The structured nature of CVP demands fewer trainable parameters, less than 1\% compared to standard visual prompts, combating overfitting. Extensive experiments and analysis on a wide variety of OOD visual perception tasks show that our approach is effective, improving robustness by up to 5.87% over several large-scale models.
    Variance Reduced Halpern Iteration for Finite-Sum Monotone Inclusions. (arXiv:2310.02987v2 [cs.LG] UPDATED)
    Machine learning approaches relying on such criteria as adversarial robustness or multi-agent settings have raised the need for solving game-theoretic equilibrium problems. Of particular relevance to these applications are methods targeting finite-sum structure, which generically arises in empirical variants of learning problems in these contexts. Further, methods with computable approximation errors are highly desirable, as they provide verifiable exit criteria. Motivated by these applications, we study finite-sum monotone inclusion problems, which model broad classes of equilibrium problems. Our main contributions are variants of the classical Halpern iteration that employ variance reduction to obtain improved complexity guarantees in which $n$ component operators in the finite sum are ``on average'' either cocoercive or Lipschitz continuous and monotone, with parameter $L$. The resulting oracle complexity of our methods, which provide guarantees for the last iterate and for a (computable) operator norm residual, is $\widetilde{\mathcal{O}}( n + \sqrt{n}L\varepsilon^{-1})$, which improves upon existing methods by a factor up to $\sqrt{n}$. This constitutes the first variance reduction-type result for general finite-sum monotone inclusions and for more specific problems such as convex-concave optimization when operator norm residual is the optimality measure. We further argue that, up to poly-logarithmic factors, this complexity is unimprovable in the monotone Lipschitz setting; i.e., the provided result is near-optimal.
    NLI4CT: Multi-Evidence Natural Language Inference for Clinical Trial Reports. (arXiv:2305.03598v2 [cs.CL] UPDATED)
    How can we interpret and retrieve medical evidence to support clinical decisions? Clinical trial reports (CTR) amassed over the years contain indispensable information for the development of personalized medicine. However, it is practically infeasible to manually inspect over 400,000+ clinical trial reports in order to find the best evidence for experimental treatments. Natural Language Inference (NLI) offers a potential solution to this problem, by allowing the scalable computation of textual entailment. However, existing NLI models perform poorly on biomedical corpora, and previously published datasets fail to capture the full complexity of inference over CTRs. In this work, we present a novel resource to advance research on NLI for reasoning on CTRs. The resource includes two main tasks. Firstly, to determine the inference relation between a natural language statement, and a CTR. Secondly, to retrieve supporting facts to justify the predicted relation. We provide NLI4CT, a corpus of 2400 statements and CTRs, annotated for these tasks. Baselines on this corpus expose the limitations of existing NLI models, with 6 state-of-the-art NLI models achieving a maximum F1 score of 0.627. To the best of our knowledge, we are the first to design a task that covers the interpretation of full CTRs. To encourage further work on this challenging dataset, we make the corpus, competition leaderboard, website and code to replicate the baseline experiments available at: https://github.com/ai-systems/nli4ct
    Unifying GANs and Score-Based Diffusion as Generative Particle Models. (arXiv:2305.16150v2 [cs.LG] UPDATED)
    Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper we challenge this interpretation, and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.
    SpatialRank: Urban Event Ranking with NDCG Optimization on Spatiotemporal Data. (arXiv:2310.00270v5 [cs.LG] UPDATED)
    The problem of urban event ranking aims at predicting the top-k most risky locations of future events such as traffic accidents and crimes. This problem is of fundamental importance to public safety and urban administration especially when limited resources are available. The problem is, however, challenging due to complex and dynamic spatio-temporal correlations between locations, uneven distribution of urban events in space, and the difficulty to correctly rank nearby locations with similar features. Prior works on event forecasting mostly aim at accurately predicting the actual risk score or counts of events for all the locations. Rankings obtained as such usually have low quality due to prediction errors. Learning-to-rank methods directly optimize measures such as Normalized Discounted Cumulative Gain (NDCG), but cannot handle the spatiotemporal autocorrelation existing among locations. In this paper, we bridge the gap by proposing a novel spatial event ranking approach named SpatialRank. SpatialRank features adaptive graph convolution layers that dynamically learn the spatiotemporal dependencies across locations from data. In addition, the model optimizes through surrogates a hybrid NDCG loss with a spatial component to better rank neighboring spatial locations. We design an importance-sampling with a spatial filtering algorithm to effectively evaluate the loss during training. Comprehensive experiments on three real-world datasets demonstrate that SpatialRank can effectively identify the top riskiest locations of crimes and traffic accidents and outperform state-of-art methods in terms of NDCG by up to 12.7%.
    Hierarchical clustering with OWA-based linkages, the Lance-Williams formula, and dendrogram inversions. (arXiv:2303.05683v2 [stat.ML] UPDATED)
    Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.
    Leave-one-out Distinguishability in Machine Learning. (arXiv:2309.17310v3 [cs.LG] UPDATED)
    We introduce a new analytical framework to quantify the changes in a machine learning algorithm's output distribution following the inclusion of a few data points in its training set, a notion we define as leave-one-out distinguishability (LOOD). This problem is key to measuring data **memorization** and **information leakage** in machine learning, and the **influence** of training data points on model predictions. We illustrate how our method broadens and refines existing empirical measures of memorization and privacy risks associated with training data. We use Gaussian processes to model the randomness of machine learning algorithms, and validate LOOD with extensive empirical analysis of information leakage using membership inference attacks. Our theoretical framework enables us to investigate the causes of information leakage and where the leakage is high. For example, we analyze the influence of activation functions, on data memorization. Additionally, our method allows us to optimize queries that disclose the most significant information about the training data in the leave-one-out setting. We illustrate how optimal queries can be used for accurate **reconstruction** of training data.
    Adaptive whitening with fast gain modulation and slow synaptic plasticity. (arXiv:2308.13633v2 [q-bio.NC] UPDATED)
    Neurons in early sensory areas rapidly adapt to changing sensory statistics, both by normalizing the variance of their individual responses and by reducing correlations between their responses. Together, these transformations may be viewed as an adaptive form of statistical whitening. Existing mechanistic models of adaptive whitening exclusively use either synaptic plasticity or gain modulation as the biological substrate for adaptation; however, on their own, each of these models has significant limitations. In this work, we unify these approaches in a normative multi-timescale mechanistic model that adaptively whitens its responses with complementary computational roles for synaptic plasticity and gain modulation. Gains are modified on a fast timescale to adapt to the current statistical context, whereas synapses are modified on a slow timescale to match structural properties of the input statistics that are invariant across contexts. Our model is derived from a novel multi-timescale whitening objective that factorizes the inverse whitening matrix into basis vectors, which correspond to synaptic weights, and a diagonal matrix, which corresponds to neuronal gains. We test our model on synthetic and natural datasets and find that the synapses learn optimal configurations over long timescales that enable adaptive whitening on short timescales using gain modulation.
    Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection. (arXiv:2302.03857v5 [cs.LG] UPDATED)
    Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks. However, ACL needs tremendous running time to generate the adversarial variants of all training data, which limits its scalability to large datasets. To speed up ACL, this paper proposes a robustness-aware coreset selection (RCS) method. RCS does not require label information and searches for an informative subset that minimizes a representational divergence, which is the distance of the representation between natural data and their virtual adversarial variants. The vanilla solution of RCS via traversing all possible subsets is computationally prohibitive. Therefore, we theoretically transform RCS into a surrogate problem of submodular maximization, of which the greedy search is an efficient solution with an optimality guarantee for the original problem. Empirically, our comprehensive results corroborate that RCS can speed up ACL by a large margin without significantly hurting the robustness transferability. Notably, to the best of our knowledge, we are the first to conduct ACL efficiently on the large-scale ImageNet-1K dataset to obtain an effective robust representation via RCS. Our source code is at https://github.com/GodXuxilie/Efficient_ACL_via_RCS.
    Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation. (arXiv:2310.13923v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications. Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers. However, previous methods assume that the collected outliers can be sufficiently large and representative to cover the boundary between ID and OOD data, which might be impractical and challenging. In this work, we propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers. Specifically, DivOE introduces a new learning objective, which diversifies the auxiliary distribution by explicitly synthesizing more informative outliers for extrapolation during training. It leverages a multi-step optimization method to generate novel outliers beyond the original ones, which is compatible with many variants of outlier exposure. Extensive experiments and analyses have been conducted to characterize and demonstrate the effectiveness of the proposed DivOE. The code is publicly available at: https://github.com/tmlr-group/DivOE.
    Monte Carlo guided Diffusion for Bayesian linear inverse problems. (arXiv:2308.07983v2 [stat.ML] UPDATED)
    Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging. A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems. Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems. In this study, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posteriors of these inverse problems get closer to the target posterior of the original inverse problem. To sample from this sequence of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The proposed algorithm, MCGDiff, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems in a Bayesian setting.
    EqDrive: Efficient Equivariant Motion Forecasting with Multi-Modality for Autonomous Driving. (arXiv:2310.17540v1 [cs.RO])
    Forecasting vehicular motions in autonomous driving requires a deep understanding of agent interactions and the preservation of motion equivariance under Euclidean geometric transformations. Traditional models often lack the sophistication needed to handle the intricate dynamics inherent to autonomous vehicles and the interaction relationships among agents in the scene. As a result, these models have a lower model capacity, which then leads to higher prediction errors and lower training efficiency. In our research, we employ EqMotion, a leading equivariant particle, and human prediction model that also accounts for invariant agent interactions, for the task of multi-agent vehicle motion forecasting. In addition, we use a multi-modal prediction mechanism to account for multiple possible future paths in a probabilistic manner. By leveraging EqMotion, our model achieves state-of-the-art (SOTA) performance with fewer parameters (1.2 million) and a significantly reduced training time (less than 2 hours).
    On Embeddings for Numerical Features in Tabular Deep Learning. (arXiv:2203.05556v4 [cs.LG] UPDATED)
    Recently, Transformer-like deep architectures have shown strong performance on tabular data problems. Unlike traditional models, e.g., MLP, these architectures map scalar values of numerical features to high-dimensional embeddings before mixing them in the main backbone. In this work, we argue that embeddings for numerical features are an underexplored degree of freedom in tabular DL, which allows constructing more powerful DL models and competing with GBDT on some traditionally GBDT-friendly benchmarks. We start by describing two conceptually different approaches to building embedding modules: the first one is based on a piecewise linear encoding of scalar values, and the second one utilizes periodic activations. Then, we empirically demonstrate that these two approaches can lead to significant performance boosts compared to the embeddings based on conventional blocks such as linear layers and ReLU activations. Importantly, we also show that embedding numerical features is beneficial for many backbones, not only for Transformers. Specifically, after proper embeddings, simple MLP-like models can perform on par with the attention-based architectures. Overall, we highlight embeddings for numerical features as an important design aspect with good potential for further improvements in tabular DL.
    Optimal Scoring Rule Design under Partial Knowledge. (arXiv:2107.07420v2 [cs.GT] UPDATED)
    This paper studies the design of optimal proper scoring rules when the principal has partial knowledge of an agent's signal distribution. Recent work characterizes the proper scoring rules that maximize the increase of an agent's payoff when the agent chooses to access a costly signal to refine a posterior belief from her prior prediction, under the assumption that the agent's signal distribution is fully known to the principal. In our setting, the principal only knows about a set of distributions where the agent's signal distribution belongs. We formulate the scoring rule design problem as a max-min optimization that maximizes the worst-case increase in payoff across the set of distributions. We propose an efficient algorithm to compute an optimal scoring rule when the set of distributions is finite, and devise a fully polynomial-time approximation scheme that accommodates various infinite sets of distributions. We further remark that widely used scoring rules, such as the quadratic and log rules, as well as previously identified optimal scoring rules under full knowledge, can be far from optimal in our partial knowledge settings.
    Towards Unifying Diffusion Models for Probabilistic Spatio-Temporal Graph Learning. (arXiv:2310.17360v1 [cs.LG])
    Spatio-temporal graph learning is a fundamental problem in the Web of Things era, which enables a plethora of Web applications such as smart cities, human mobility and climate analysis. Existing approaches tackle different learning tasks independently, tailoring their models to unique task characteristics. These methods, however, fall short of modeling intrinsic uncertainties in the spatio-temporal data. Meanwhile, their specialized designs limit their universality as general spatio-temporal learning solutions. In this paper, we propose to model the learning tasks in a unified perspective, viewing them as predictions based on conditional information with shared spatio-temporal patterns. Based on this proposal, we introduce Unified Spatio-Temporal Diffusion Models (USTD) to address the tasks uniformly within the uncertainty-aware diffusion framework. USTD is holistically designed, comprising a shared spatio-temporal encoder and attention-based denoising networks that are task-specific. The shared encoder, optimized by a pre-training strategy, effectively captures conditional spatio-temporal patterns. The denoising networks, utilizing both cross- and self-attention, integrate conditional dependencies and generate predictions. Opting for forecasting and kriging as downstream tasks, we design Gated Attention (SGA) and Temporal Gated Attention (TGA) for each task, with different emphases on the spatial and temporal dimensions, respectively. By combining the advantages of deterministic encoders and probabilistic diffusion models, USTD achieves state-of-the-art performances compared to deterministic and probabilistic baselines in both tasks, while also providing valuable uncertainty estimates.
    Human-Guided Complexity-Controlled Abstractions. (arXiv:2310.17550v1 [cs.LG])
    Neural networks often learn task-specific latent representations that fail to generalize to novel settings or tasks. Conversely, humans learn discrete representations (i.e., concepts or words) at a variety of abstraction levels (e.g., ``bird'' vs. ``sparrow'') and deploy the appropriate abstraction based on task. Inspired by this, we train neural models to generate a spectrum of discrete representations, and control the complexity of the representations (roughly, how many bits are allocated for encoding inputs) by tuning the entropy of the distribution over representations. In finetuning experiments, using only a small number of labeled examples for a new task, we show that (1) tuning the representation to a task-appropriate complexity level supports the highest finetuning performance, and (2) in a human-participant study, users were able to identify the appropriate complexity level for a downstream task using visualizations of discrete representations. Our results indicate a promising direction for rapid model finetuning by leveraging human insight.
    Variance of ML-based software fault predictors: are we really improving fault prediction?. (arXiv:2310.17264v1 [cs.SE])
    Software quality assurance activities become increasingly difficult as software systems become more and more complex and continuously grow in size. Moreover, testing becomes even more expensive when dealing with large-scale systems. Thus, to effectively allocate quality assurance resources, researchers have proposed fault prediction (FP) which utilizes machine learning (ML) to predict fault-prone code areas. However, ML algorithms typically make use of stochastic elements to increase the prediction models' generalizability and efficiency of the training process. These stochastic elements, also known as nondeterminism-introducing (NI) factors, lead to variance in the training process and as a result, lead to variance in prediction accuracy and training time. This variance poses a challenge for reproducibility in research. More importantly, while fault prediction models may have shown good performance in the lab (e.g., often-times involving multiple runs and averaging outcomes), high variance of results can pose the risk that these models show low performance when applied in practice. In this work, we experimentally analyze the variance of a state-of-the-art fault prediction approach. Our experimental results indicate that NI factors can indeed cause considerable variance in the fault prediction models' accuracy. We observed a maximum variance of 10.10% in terms of the per-class accuracy metric. We thus, also discuss how to deal with such variance.
    Optimization dependent generalization bound for ReLU networks based on sensitivity in the tangent bundle. (arXiv:2310.17378v1 [cs.LG])
    Recent advances in deep learning have given us some very promising results on the generalization ability of deep neural networks, however literature still lacks a comprehensive theory explaining why heavily over-parametrized models are able to generalize well while fitting the training data. In this paper we propose a PAC type bound on the generalization error of feedforward ReLU networks via estimating the Rademacher complexity of the set of networks available from an initial parameter vector via gradient descent. The key idea is to bound the sensitivity of the network's gradient to perturbation of the input data along the optimization trajectory. The obtained bound does not explicitly depend on the depth of the network. Our results are experimentally verified on the MNIST and CIFAR-10 datasets.
    Sign Languague Recognition without frame-sequencing constraints: A proof of concept on the Argentinian Sign Language. (arXiv:2310.17437v1 [cs.CV])
    Automatic sign language recognition (SLR) is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people, as well as the teaching of sign language for the hearing population. SLR systems usually employ Hidden Markov Models, Dynamic Time Warping or similar models to recognize signs. Such techniques exploit the sequential ordering of frames to reduce the number of hypothesis. This paper presents a general probabilistic model for sign classification that combines sub-classifiers based on different types of features such as position, movement and handshape. The model employs a bag-of-words approach in all classification steps, to explore the hypothesis that ordering is not essential for recognition. The proposed model achieved an accuracy rate of 97% on an Argentinian Sign Language dataset containing 64 classes of signs and 3200 samples, providing some evidence that indeed recognition without ordering is possible.
    Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation Models: A Multi-Agent Deep Reinforcement Learning Approach. (arXiv:2310.17492v1 [cs.AI])
    The efficient deployment and fine-tuning of foundation models are pivotal in contemporary artificial intelligence. In this study, we present a groundbreaking paradigm integrating Mobile Edge Computing (MEC) with foundation models, specifically designed to enhance local task performance on user equipment (UE). Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules. This design not only conserves computational resources but also ensures adaptability and fine-tuning efficiency for downstream tasks. Additionally, we introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings. To address the challenges presented by this system, we employ a hybrid multi-agent Deep Reinforcement Learning (DRL) strategy, adept at handling mixed discrete-continuous action spaces, ensuring dynamic and optimal resource allocations. Our comprehensive simulations and validations underscore the practical viability of our approach, demonstrating its robustness, efficiency, and scalability. Collectively, this work offers a fresh perspective on deploying foundation models and balancing computational efficiency with task proficiency.
    Secure short-term load forecasting for smart grids with transformer-based federated learning. (arXiv:2310.17477v1 [cs.LG])
    Electricity load forecasting is an essential task within smart grids to assist demand and supply balance. While advanced deep learning models require large amounts of high-resolution data for accurate short-term load predictions, fine-grained load profiles can expose users' electricity consumption behaviors, which raises privacy and security concerns. One solution to improve data privacy is federated learning, where models are trained locally on private data, and only the trained model parameters are merged and updated on a global server. Therefore, this paper presents a novel transformer-based deep learning approach with federated learning for short-term electricity load prediction. To evaluate our results, we benchmark our federated learning architecture against central and local learning and compare the performance of our model to long short-term memory models and convolutional neural networks. Our simulations are based on a dataset from a German university campus and show that transformer-based forecasting is a promising alternative to state-of-the-art models within federated learning.
    De-novo Chemical Reaction Generation by Means of Temporarily Convolutional Neural Networks. (arXiv:2310.17341v1 [cs.LG])
    We present here a combination of two networks, Recurrent Neural Networks (RNN) and Temporarily Convolutional Neural Networks (TCN) in de novo reaction generation using the novel Reaction Smiles-like representation of reactions (CGRSmiles) with atom mapping directly incorporated. Recurrent Neural Networks are known for their autoregressive properties and are frequently used in language modelling with direct application to SMILES generation. The relatively novel TCNs possess similar properties with wide receptive field while obeying the causality required for natural language processing (NLP). The combination of both latent representations expressed through TCN and RNN results in an overall better performance compared to RNN alone. Additionally, it is shown that different fine-tuning protocols have a profound impact on generative scope of the model when applied on a dataset of interest via transfer learning.
    Enhancing Graph Neural Networks with Structure-Based Prompt. (arXiv:2310.17394v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful in learning semantics of graph data. Recently, a new paradigm "pre-train, prompt" has shown promising results in adapting GNNs to various tasks with less supervised data. The success of such paradigm can be attributed to the more consistent objectives of pre-training and task-oriented prompt tuning, where the pre-trained knowledge can be effectively transferred to downstream tasks. However, an overlooked issue of existing studies is that the structure information of graph is usually exploited during pre-training for learning node representations, while neglected in the prompt tuning stage for learning task-specific parameters. To bridge this gap, we propose a novel structure-based prompting method for GNNs, namely SAP, which consistently exploits structure information in both pre-training and prompt tuning stages. In particular, SAP 1) employs a dual-view contrastive learning to align the latent semantic spaces of node attributes and graph structure, and 2) incorporates structure information in prompted graph to elicit more pre-trained knowledge in prompt tuning. We conduct extensive experiments on node classification and graph classification tasks to show the effectiveness of SAP. Moreover, we show that SAP can lead to better performance in more challenging few-shot scenarios on both homophilous and heterophilous graphs.
    Bayesian Neural Controlled Differential Equations for Treatment Effect Estimation. (arXiv:2310.17463v1 [cs.LG])
    Treatment effect estimation in continuous time is crucial for personalized medicine. However, existing methods for this task are limited to point estimates of the potential outcomes, whereas uncertainty estimates have been ignored. Needless to say, uncertainty quantification is crucial for reliable decision-making in medical applications. To fill this gap, we propose a novel Bayesian neural controlled differential equation (BNCDE) for treatment effect estimation in continuous time. In our BNCDE, the time dimension is modeled through a coupled system of neural controlled differential equations and neural stochastic differential equations, where the neural stochastic differential equations allow for tractable variational Bayesian inference. Thereby, for an assigned sequence of treatments, our BNCDE provides meaningful posterior predictive distributions of the potential outcomes. To the best of our knowledge, ours is the first tailored neural method to provide uncertainty estimates of treatment effects in continuous time. As such, our method is of direct practical value for promoting reliable decision-making in medicine.
    Detection Defenses: An Empty Promise against Adversarial Patch Attacks on Optical Flow. (arXiv:2310.17403v1 [cs.CV])
    Adversarial patches undermine the reliability of optical flow predictions when placed in arbitrary scene locations. Therefore, they pose a realistic threat to real-world motion detection and its downstream applications. Potential remedies are defense strategies that detect and remove adversarial patches, but their influence on the underlying motion prediction has not been investigated. In this paper, we thoroughly examine the currently available detect-and-remove defenses ILP and LGS for a wide selection of state-of-the-art optical flow methods, and illuminate their side effects on the quality and robustness of the final flow predictions. In particular, we implement defense-aware attacks to investigate whether current defenses are able to withstand attacks that take the defense mechanism into account. Our experiments yield two surprising results: Detect-and-remove defenses do not only lower the optical flow quality on benign scenes, in doing so, they also harm the robustness under patch attacks for all tested optical flow methods except FlowNetC. As currently employed detect-and-remove defenses fail to deliver the promised adversarial robustness for optical flow, they evoke a false sense of security. The code is available at https://github.com/cv-stuttgart/DetectionDefenses.
    Likelihood-based Out-of-Distribution Detection with Denoising Diffusion Probabilistic Models. (arXiv:2310.17432v1 [cs.LG])
    Out-of-Distribution detection between dataset pairs has been extensively explored with generative models. We show that likelihood-based Out-of-Distribution detection can be extended to diffusion models by leveraging the fact that they, like other likelihood-based generative models, are dramatically affected by the input sample complexity. Currently, all Out-of-Distribution detection methods with Diffusion Models are reconstruction-based. We propose a new likelihood ratio for Out-of-Distribution detection with Deep Denoising Diffusion Models, which we call the Complexity Corrected Likelihood Ratio. Our likelihood ratio is constructed using Evidence Lower-Bound evaluations from an individual model at various noising levels. We present results that are comparable to state-of-the-art Out-of-Distribution detection methods with generative models.
    Cross-modal Active Complementary Learning with Self-refining Correspondence. (arXiv:2310.17468v1 [cs.CV])
    Recently, image-text matching has attracted more and more attention from academia and industry, which is fundamental to understanding the latent correspondence across visual and textual modalities. However, most existing methods implicitly assume the training pairs are well-aligned while ignoring the ubiquitous annotation noise, a.k.a noisy correspondence (NC), thereby inevitably leading to a performance drop. Although some methods attempt to address such noise, they still face two challenging problems: excessive memorizing/overfitting and unreliable correction for NC, especially under high noise. To address the two problems, we propose a generalized Cross-modal Robust Complementary Learning framework (CRCL), which benefits from a novel Active Complementary Loss (ACL) and an efficient Self-refining Correspondence Correction (SCC) to improve the robustness of existing methods. Specifically, ACL exploits active and complementary learning losses to reduce the risk of providing erroneous supervision, leading to theoretically and experimentally demonstrated robustness against NC. SCC utilizes multiple self-refining processes with momentum correction to enlarge the receptive field for correcting correspondences, thereby alleviating error accumulation and achieving accurate and stable corrections. We carry out extensive experiments on three image-text benchmarks, i.e., Flickr30K, MS-COCO, and CC152K, to verify the superior robustness of our CRCL against synthetic and real-world noisy correspondences.
    Causal Modeling with Stationary Diffusions. (arXiv:2310.17405v1 [cs.LG])
    We develop a novel approach towards causal inference. Rather than structural equations over a causal graph, we learn stochastic differential equations (SDEs) whose stationary densities model a system's behavior under interventions. These stationary diffusion models do not require the formalism of causal graphs, let alone the common assumption of acyclicity. We show that in several cases, they generalize to unseen interventions on their variables, often better than classical approaches. Our inference method is based on a new theoretical result that expresses a stationarity condition on the diffusion's generator in a reproducing kernel Hilbert space. The resulting kernel deviation from stationarity (KDS) is an objective function of independent interest.
    Towards Learning Monocular 3D Object Localization From 2D Labels using the Physical Laws of Motion. (arXiv:2310.17462v1 [cs.CV])
    We present a novel method for precise 3D object localization in single images from a single calibrated camera using only 2D labels. No expensive 3D labels are needed. Thus, instead of using 3D labels, our model is trained with easy-to-annotate 2D labels along with the physical knowledge of the object's motion. Given this information, the model can infer the latent third dimension, even though it has never seen this information during training. Our method is evaluated on both synthetic and real-world datasets, and we are able to achieve a mean distance error of just 6 cm in our experiments on real data. The results indicate the method's potential as a step towards learning 3D object location estimation, where collecting 3D data for training is not feasible.
    On the recognition of the game type based on physiological signals and eye tracking. (arXiv:2310.17383v1 [cs.LG])
    Automated interpretation of signals yields many impressive applications from the area of affective computing and human activity recognition (HAR). In this paper we ask the question about possibility of cognitive activity recognition on the base of particular set of signals. We use recognition of the game played by the participant as a playground for exploration of the problem. We build classifier of three different games (Space Invaders, Tetris, Tower Defence) and inter-game pause. We validate classifier in the player-independent and player-dependent scenario. We discuss the improvement in the player-dependent scenario in the context of biometric person recognition. On the base of the results obtained in game classification, we consider potential applications in smart surveillance and quantified self.
    Tackling Interference Induced by Data Training Loops in A/B Tests: A Weighted Training Approach. (arXiv:2310.17496v1 [stat.ME])
    In modern recommendation systems, the standard pipeline involves training machine learning models on historical data to predict user behaviors and improve recommendations continuously. However, these data training loops can introduce interference in A/B tests, where data generated by control and treatment algorithms, potentially with different distributions, are combined. To address these challenges, we introduce a novel approach called weighted training. This approach entails training a model to predict the probability of each data point appearing in either the treatment or control data and subsequently applying weighted losses during model training. We demonstrate that this approach achieves the least variance among all estimators without causing shifts in the training distributions. Through simulation studies, we demonstrate the lower bias and variance of our approach compared to other methods.
    The IMS Toucan System for the Blizzard Challenge 2023. (arXiv:2310.17499v1 [cs.CL])
    For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.
    Handshape recognition for Argentinian Sign Language using ProbSom. (arXiv:2310.17427v1 [cs.CV])
    Automatic sign language recognition is an important topic within the areas of human-computer interaction and machine learning. On the one hand, it poses a complex challenge that requires the intervention of various knowledge areas, such as video processing, image processing, intelligent systems and linguistics. On the other hand, robust recognition of sign language could assist in the translation process and the integration of hearing-impaired people. This paper offers two main contributions: first, the creation of a database of handshapes for the Argentinian Sign Language (LSA), which is a topic that has barely been discussed so far. Secondly, a technique for image processing, descriptor extraction and subsequent handshape classification using a supervised adaptation of self-organizing maps that is called ProbSom. This technique is compared to others in the state of the art, such as Support Vector Machines (SVM), Random Forests, and Neural Networks. The database that was built contains 800 images with 16 LSA handshapes, and is a first step towards building a comprehensive database of Argentinian signs. The ProbSom-based neural classifier, using the proposed descriptor, achieved an accuracy rate above 90%.
    Learning Regularized Graphon Mean-Field Games with Unknown Graphons. (arXiv:2310.17531v1 [cs.GT])
    We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games (GMFGs). In contrast to previous works that require the precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons are unknown. Our contributions are threefold. First, we propose the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, improving on a previous work by Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient algorithms to estimate the transition kernels, reward functions, and graphons from sampled agents. Convergence rates are then derived when the positions of the agents are either known or unknown. Results for the combination of the optimization algorithm GMFG-PPO and the estimation algorithm are then provided. These algorithms are the first specifically designed for learning graphons from sampled agents. Finally, the efficacy of the proposed algorithms are corroborated through simulations. These simulations demonstrate that learning the unknown graphons reduces the exploitability effectively.
    Interactive Robot Learning from Verbal Correction. (arXiv:2310.17555v1 [cs.RO])
    The ability to learn and refine behavior after deployment has become ever more important for robots as we design them to operate in unstructured environments like households. In this work, we design a new learning system based on large language model (LLM), OLAF, that allows everyday users to teach a robot using verbal corrections when the robot makes mistakes, e.g., by saying "Stop what you're doing. You should move closer to the cup." A key feature of OLAF is its ability to update the robot's visuomotor neural policy based on the verbal feedback to avoid repeating mistakes in the future. This is in contrast to existing LLM-based robotic systems, which only follow verbal commands or corrections but not learn from them. We demonstrate the efficacy of our design in experiments where a user teaches a robot to perform long-horizon manipulation tasks both in simulation and on physical hardware, achieving on average 20.0% improvement in policy success rate. Videos and more results are at https://ut-austin-rpl.github.io/olaf/
    Invariance Measures for Neural Networks. (arXiv:2310.17404v1 [cs.LG])
    Invariances in neural networks are useful and necessary for many tasks. However, the representation of the invariance of most neural network models has not been characterized. We propose measures to quantify the invariance of neural networks in terms of their internal representation. The measures are efficient and interpretable, and can be applied to any neural network model. They are also more sensitive to invariance than previously defined measures. We validate the measures and their properties in the domain of affine transformations and the CIFAR10 and MNIST datasets, including their stability and interpretability. Using the measures, we perform a first analysis of CNN models and show that their internal invariance is remarkably stable to random weight initializations, but not to changes in dataset or transformation. We believe the measures will enable new avenues of research in invariance representation.
    The Expressive Power of Low-Rank Adaptation. (arXiv:2310.17513v1 [cs.LG])
    Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.
    Coalitional Bargaining via Reinforcement Learning: An Application to Collaborative Vehicle Routing. (arXiv:2310.17458v1 [cs.LG])
    Collaborative Vehicle Routing is where delivery companies cooperate by sharing their delivery information and performing delivery requests on behalf of each other. This achieves economies of scale and thus reduces cost, greenhouse gas emissions, and road congestion. But which company should partner with whom, and how much should each company be compensated? Traditional game theoretic solution concepts, such as the Shapley value or nucleolus, are difficult to calculate for the real-world problem of Collaborative Vehicle Routing due to the characteristic function scaling exponentially with the number of agents. This would require solving the Vehicle Routing Problem (an NP-Hard problem) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function, and thus eliminate the need to evaluate the VRP an exponential number of times - we only need to evaluate it once. Our contribution is that our decentralised approach is both scalable and considers the self-interested nature of companies. The agents learn using a modified Independent Proximal Policy Optimisation. Our RL agents outperform a strong heuristic bot. The agents correctly identify the optimal coalitions 79% of the time with an average optimality gap of 4.2% and reduction in run-time of 62%.
    CBD: A Certified Backdoor Detector Based on Local Dominant Probability. (arXiv:2310.17498v1 [cs.LG])
    Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.
    Hierarchical Ensemble-Based Feature Selection for Time Series Forecasting. (arXiv:2310.17544v1 [cs.LG])
    We study a novel ensemble approach for feature selection based on hierarchical stacking in cases of non-stationarity and limited number of samples with large number of features. Our approach exploits the co-dependency between features using a hierarchical structure. Initially, a machine learning model is trained using a subset of features, and then the model's output is updated using another algorithm with the remaining features to minimize the target loss. This hierarchical structure allows for flexible depth and feature selection. By exploiting feature co-dependency hierarchically, our proposed approach overcomes the limitations of traditional feature selection methods and feature importance scores. The effectiveness of the approach is demonstrated on synthetic and real-life datasets, indicating improved performance with scalability and stability compared to the traditional methods and state-of-the-art approaches.
    A Challenge in Reweighting Data with Bilevel Optimization. (arXiv:2310.17386v1 [stat.ML])
    In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, as it ideally allows one to automatically learn the importance of each training point for generalization on the testing set. This task is usually formalized as a bilevel optimization problem. Classical bilevel solvers are based on a warm-start strategy where both the parameters of the models and the data weights are learned at the same time. We show that this joint dynamic may lead to sub-optimal solutions, for which the final data weights are very sparse. This finding illustrates the difficulty of data reweighting and offers a clue as to why this method is rarely used in practice.
    Foundation Model Based Native AI Framework in 6G with Cloud-Edge-End Collaboration. (arXiv:2310.17471v1 [cs.IT])
    Future wireless communication networks are in a position to move beyond data-centric, device-oriented connectivity and offer intelligent, immersive experiences based on task-oriented connections, especially in the context of the thriving development of pre-trained foundation models (PFM) and the evolving vision of 6G native artificial intelligence (AI). Therefore, redefining modes of collaboration between devices and servers and constructing native intelligence libraries become critically important in 6G. In this paper, we analyze the challenges of achieving 6G native AI from the perspectives of data, intelligence, and networks. Then, we propose a 6G native AI framework based on foundation models, provide a customization approach for intent-aware PFM, present a construction of a task-oriented AI toolkit, and outline a novel cloud-edge-end collaboration paradigm. As a practical use case, we apply this framework for orchestration, achieving the maximum sum rate within a wireless communication system, and presenting preliminary evaluation results. Finally, we outline research directions for achieving native AI in 6G.
    Fair collaborative vehicle routing: A deep multi-agent reinforcement learning approach. (arXiv:2310.17485v1 [cs.LG])
    Collaborative vehicle routing occurs when carriers collaborate through sharing their transportation requests and performing transportation requests on behalf of each other. This achieves economies of scale, thus reducing cost, greenhouse gas emissions and road congestion. But which carrier should partner with whom, and how much should each carrier be compensated? Traditional game theoretic solution concepts are expensive to calculate as the characteristic function scales exponentially with the number of agents. This would require solving the vehicle routing problem (NP-hard) an exponential number of times. We therefore propose to model this problem as a coalitional bargaining game solved using deep multi-agent reinforcement learning, where - crucially - agents are not given access to the characteristic function. Instead, we implicitly reason about the characteristic function; thus, when deployed in production, we only need to evaluate the expensive post-collaboration vehicle routing problem once. Our contribution is that we are the first to consider both the route allocation problem and gain sharing problem simultaneously - without access to the expensive characteristic function. Through decentralised machine learning, our agents bargain with each other and agree to outcomes that correlate well with the Shapley value - a fair profit allocation mechanism. Importantly, we are able to achieve a reduction in run-time of 88%.
    Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. (arXiv:2310.17074v1 [cs.LG])
    In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon "benign oscillation". Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small $\ell_2$-norm and appear in each data point; (ii) strong features which have a larger $\ell_2$-norm but only appear in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but makes little progress in learning the weak features. Consequently, when it comes to the new testing data which consist of only weak features, the NN trained by oscillating SGD with a large learning rate could still make correct predictions consistently, while the NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our finding on "benign oscillation".
    Demonstration-Regularized RL. (arXiv:2310.17303v1 [stat.ML])
    Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.
    Weakly-Supervised Surgical Phase Recognition. (arXiv:2310.17209v1 [cs.CV])
    A key element of computer-assisted surgery systems is phase recognition of surgical videos. Existing phase recognition algorithms require frame-wise annotation of a large number of videos, which is time and money consuming. In this work we join concepts of graph segmentation with self-supervised learning to derive a random-walk solution for per-frame phase prediction. Furthermore, we utilize within our method two forms of weak supervision: sparse timestamps or few-shot learning. The proposed algorithm enjoys low complexity and can operate in lowdata regimes. We validate our method by running experiments with the public Cholec80 dataset of laparoscopic cholecystectomy videos, demonstrating promising performance in multiple setups.
    On Forecast Stability. (arXiv:2310.17332v1 [cs.LG])
    Forecasts are typically not produced in a vacuum but in a business context, where forecasts are generated on a regular basis and interact with each other. For decisions, it may be important that forecasts do not change arbitrarily, and are stable in some sense. However, this area has received only limited attention in the forecasting literature. In this paper, we explore two types of forecast stability that we call vertical stability and horizontal stability. The existing works in the literature are only applicable to certain base models and extending these frameworks to be compatible with any base model is not straightforward. Furthermore, these frameworks can only stabilise the forecasts vertically. To fill this gap, we propose a simple linear-interpolation-based approach that is applicable to stabilise the forecasts provided by any base model vertically and horizontally. The approach can produce both accurate and stable forecasts. Using N-BEATS, Pooled Regression and LightGBM as the base models, in our evaluation on four publicly available datasets, the proposed framework is able to achieve significantly higher stability and/or accuracy compared to a set of benchmarks including a state-of-the-art forecast stabilisation method across three error metrics and six stability metrics.
    fairret: a Framework for Differentiable Fairness Regularization Terms. (arXiv:2310.17256v1 [cs.LG])
    Current tools for machine learning fairness only admit a limited range of fairness definitions and have seen little integration with automatic differentiation libraries, despite the central role these libraries play in modern machine learning pipelines. We introduce a framework of fairness regularization terms (fairrets) which quantify bias as modular objectives that are easily integrated in automatic differentiation pipelines. By employing a general definition of fairness in terms of linear-fractional statistics, a wide class of fairrets can be computed efficiently. Experiments show the behavior of their gradients and their utility in enforcing fairness with minimal loss of predictive power compared to baselines. Our contribution includes a PyTorch implementation of the fairret framework.
    Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise. (arXiv:2310.17167v1 [cs.LG])
    This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes. The first contribution involves reparameterizing the diffusion process in terms of the angle on a quarter-circular arc between the image and noise, specifically setting the conventional $\displaystyle \sqrt{\bar{\alpha}}=\cos(\eta)$. This reparameterization eliminates two singularities and allows for the expression of diffusion evolution as a well-behaved ordinary differential equation (ODE). In turn, this allows higher order ODE solvers such as Runge-Kutta methods to be used effectively. The second contribution is to directly estimate both the image ($\mathbf{x}_0$) and noise ($\mathbf{\epsilon}$) using our network, which enables more stable calculations of the update step in the inverse diffusion steps, as accurate estimation of both the image and noise are crucial at different stages of the process. Together with these changes, our model achieves faster generation, with the ability to converge on high-quality images more quickly, and higher quality of the generated images, as measured by metrics such as Frechet Inception Distance (FID), spatial Frechet Inception Distance (sFID), precision, and recall.
    C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder. (arXiv:2310.17325v1 [cs.LG])
    Representation learning assumes that real-world data is generated by a few semantically meaningful generative factors (i.e., sources of variation) and aims to discover them in the latent space. These factors are expected to be causally disentangled, meaning that distinct factors are encoded into separate latent variables, and changes in one factor will not affect the values of the others. Compared to statistical independence, causal disentanglement allows more controllable data generation, improved robustness, and better generalization. However, most existing work assumes unconfoundedness in the discovery process, that there are no common causes to the generative factors and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework entitled Confounded-Disentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels from domain expertise. In addition, we accordingly propose an approach to sufficiently identify the causally disentangled factors under any inductive bias of the confounder. We conduct extensive experiments on both synthetic and real-world datasets. Our method demonstrates competitive results compared to various SOTA baselines in obtaining causally disentangled features and downstream tasks under domain shifts.
    CQM: Curriculum Reinforcement Learning with a Quantized World Model. (arXiv:2310.17330v1 [cs.LG])
    Recent curriculum Reinforcement Learning (RL) has shown notable progress in solving complex tasks by proposing sequences of surrogate tasks. However, the previous approaches often face challenges when they generate curriculum goals in a high-dimensional space. Thus, they usually rely on manually specified goal spaces. To alleviate this limitation and improve the scalability of the curriculum, we propose a novel curriculum method that automatically defines the semantic goal space which contains vital information for the curriculum process, and suggests curriculum goals over it. To define the semantic goal space, our method discretizes continuous observations via vector quantized-variational autoencoders (VQ-VAE) and restores the temporal relations between the discretized observations by a graph. Concurrently, ours suggests uncertainty and temporal distance-aware curriculum goals that converges to the final goals over the automatically composed goal space. We demonstrate that the proposed method allows efficient explorations in an uninformed environment with raw goal examples only. Also, ours outperforms the state-of-the-art curriculum RL methods on data efficiency and performance, in various goal-reaching tasks even with ego-centric visual inputs.
    How do Language Models Bind Entities in Context?. (arXiv:2310.17191v1 [cs.LG])
    To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a "green square" and a "blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.
    A multi-artifact EEG denoising by frequency-based deep learning. (arXiv:2310.17335v1 [cs.LG])
    Electroencephalographic (EEG) signals are fundamental to neuroscience research and clinical applications such as brain-computer interfaces and neurological disorder diagnosis. These signals are typically a combination of neurological activity and noise, originating from various sources, including physiological artifacts like ocular and muscular movements. Under this setting, we tackle the challenge of distinguishing neurological activity from noise-related sources. We develop a novel EEG denoising model that operates in the frequency domain, leveraging prior knowledge about noise spectral features to adaptively compute optimal convolutional filters for noise separation. The model is trained to learn an empirical relationship connecting the spectral characteristics of noise and noisy signal to a non-linear transformation which allows signal denoising. Performance evaluation on the EEGdenoiseNet dataset shows that the proposed model achieves optimal results according to both temporal and spectral metrics. The model is found to remove physiological artifacts from input EEG data, thus achieving effective EEG denoising. Indeed, the model performance either matches or outperforms that achieved by benchmark models, proving to effectively remove both muscle and ocular artifacts without the need to perform any training on the particular type of artifact.
    DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics. (arXiv:2310.13268v2 [cs.CV] UPDATED)
    Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose \textit{DPM-Solver-v3}, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call \textit{empirical model statistics}. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15\%$\sim$30\% compared to previous state-of-the-art training-free methods. Code is available at \url{https://github.com/thu-ml/DPM-Solver-v3}.
    Graphical Object-Centric Actor-Critic. (arXiv:2310.17178v1 [cs.AI])
    There have recently been significant advances in the problem of unsupervised object-centric representation learning and its application to downstream tasks. The latest works support the argument that employing disentangled object representations in image-based object-centric reinforcement learning tasks facilitates policy learning. We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches to utilize these representations effectively. In our approach, we use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment. The proposed method fills a research gap in developing efficient object-centric world models for reinforcement learning settings that can be used for environments with discrete or continuous action spaces. Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm built upon transformer architecture and the state-of-the-art monolithic model-based algorithm.
    CROP: Conservative Reward for Model-based Offline Policy Optimization. (arXiv:2310.17245v1 [cs.LG])
    Offline reinforcement learning (RL) aims to optimize policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges due to their capability to mitigate the limitations of offline data through data generation using models. Prior research has demonstrated that introducing conservatism into the model or Q-function during policy optimization can effectively alleviate the prevalent distribution drift problem in offline RL. However, the investigation into the impacts of conservatism in reward estimation is still lacking. This paper proposes a novel model-based offline RL algorithm, Conservative Reward for model-based Offline Policy optimization (CROP), which conservatively estimates the reward in model training. To achieve a conservative reward estimation, CROP simultaneously minimizes the estimation error and the reward of random actions. Theoretical analysis shows that this conservative reward mechanism leads to a conservative policy evaluation and helps mitigate distribution drift. Experiments on D4RL benchmarks showcase that the performance of CROP is comparable to the state-of-the-art baselines. Notably, CROP establishes an innovative connection between offline and online RL, highlighting that offline RL problems can be tackled by adopting online RL techniques to the empirical Markov decision process trained with a conservative reward. The source code is available with https://github.com/G0K0URURI/CROP.git.
    Multi-scale Diffusion Denoised Smoothing. (arXiv:2310.16779v2 [cs.LG] UPDATED)
    Along with recent diffusion models, randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale, e.g., those of large pre-trained models. Specifically, one can perform randomized smoothing on any classifier via a simple "denoise-and-classify" pipeline, so-called denoised smoothing, given that an accurate denoiser is available - such as diffusion model. In this paper, we present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to "selectively" apply smoothing among multiple noise scales, coined multi-scale smoothing, which can be efficiently implemented with a single diffusion model. This approach also suggests a new objective to compare the collective robustness of multi-scale smoothed classifiers, and questions which representation of diffusion model would maximize the objective. To address this, we propose to further fine-tune diffusion model (a) to perform consistent denoising whenever the original image is recoverable, but (b) to generate rather diverse outputs otherwise. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning enables strong certified robustness available with high noise level while maintaining its accuracy closer to non-smoothed classifiers.
    Network Design through Graph Neural Networks: Identifying Challenges and Improving Performance. (arXiv:2310.17100v1 [cs.LG])
    Graph Neural Network (GNN) research has produced strategies to modify a graph's edges using gradients from a trained GNN, with the goal of network design. However, the factors which govern gradient-based editing are understudied, obscuring why edges are chosen and if edits are grounded in an edge's importance. Thus, we begin by analyzing the gradient computation in previous works, elucidating the factors that influence edits and highlighting the potential over-reliance on structural properties. Specifically, we find that edges can achieve high gradients due to structural biases, rather than importance, leading to erroneous edits when the factors are unrelated to the design task. To improve editing, we propose ORE, an iterative editing method that (a) edits the highest scoring edges and (b) re-embeds the edited graph to refresh gradients, leading to less biased edge choices. We empirically study ORE through a set of proposed design tasks, each with an external validation method, demonstrating that ORE improves upon previous methods by up to 50%.
    Technical Note: Feasibility of translating 3.0T-trained Deep-Learning Segmentation Models Out-of-the-Box on Low-Field MRI 0.55T Knee-MRI of Healthy Controls. (arXiv:2310.17152v1 [cs.CV])
    In the current study, our purpose is to evaluate the feasibility of applying deep learning (DL) enabled algorithms to quantify bilateral knee biomarkers in healthy controls scanned at 0.55T and compared with 3.0T. The current study assesses the performance of standard in-practice bone, and cartilage segmentation algorithms at 0.55T, both qualitatively and quantitatively, in terms of comparing segmentation performance, areas of improvement, and compartment-wise cartilage thickness values between 0.55T vs. 3.0T. Initial results demonstrate a usable to good technical feasibility of translating existing quantitative deep-learning-based image segmentation techniques, trained on 3.0T, out of 0.55T for knee MRI, in a multi-vendor acquisition environment. Especially in terms of segmenting cartilage compartments, the models perform almost equivalent to 3.0T in terms of Likert ranking. The 0.55T low-field sustainable and easy-to-install MRI, as demonstrated, thus, can be utilized for evaluating knee cartilage thickness and bone segmentations aided by established DL algorithms trained at higher-field strengths out-of-the-box initially. This could be utilized at the far-spread point-of-care locations with a lack of radiologists available to manually segment low-field images, at least till a decent base of low-field data pool is collated. With further fine-tuning with manual labeling of low-field data or utilizing synthesized higher SNR images from low-field images, OA biomarker quantification performance is potentially guaranteed to be further improved.
    PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks. (arXiv:2307.05891v2 [cs.LG] UPDATED)
    Deep reinforcement learning (RL) has shown immense potential for learning to control systems through data alone. However, one challenge deep RL faces is that the full state of the system is often not observable. When this is the case, the policy needs to leverage the history of observations to infer the current state. At the same time, differences between the training and testing environments makes it critical for the policy not to overfit to the sequence of observations it sees at training time. As such, there is an important balancing act between having the history encoder be flexible enough to extract relevant information, yet be robust to changes in the environment. To strike this balance, we look to the PID controller for inspiration. We assert the PID controller's success shows that only summing and differencing are needed to accumulate information over time for many control tasks. Following this principle, we propose two architectures for encoding history: one that directly uses PID features and another that extends these core ideas and can be used in arbitrary control tasks. When compared with prior approaches, our encoders produce policies that are often more robust and achieve better performance on a variety of tracking tasks. Going beyond tracking tasks, our policies achieve 1.7x better performance on average over previous state-of-the-art methods on a suite of locomotion control tasks.
    TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023. (arXiv:2307.14338v2 [cs.LG] UPDATED)
    Deep learning (DL) models for tabular data problems (e.g. classification, regression) are currently receiving increasingly more attention from researchers. However, despite the recent efforts, the non-DL algorithms based on gradient-boosted decision trees (GBDT) remain a strong go-to solution for these problems. One of the research directions aimed at improving the position of tabular DL involves designing so-called retrieval-augmented models. For a target object, such models retrieve other objects (e.g. the nearest neighbors) from the available training data and use their features and labels to make a better prediction. In this work, we present TabR -- essentially, a feed-forward network with a custom k-Nearest-Neighbors-like component in the middle. On a set of public benchmarks with datasets up to several million objects, TabR marks a big step forward for tabular DL: it demonstrates the best average performance among tabular DL models, becomes the new state-of-the-art on several datasets, and even outperforms GBDT models on the recently proposed "GBDT-friendly" benchmark (see Figure 1). Among the important findings and technical details powering TabR, the main ones lie in the attention-like mechanism that is responsible for retrieving the nearest neighbors and extracting valuable signal from them. In addition to the much higher performance, TabR is simple and significantly more efficient compared to prior retrieval-based tabular DL models.
    DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic. (arXiv:2310.17173v1 [cs.LG])
    We present a novel extension to the family of Soft Actor-Critic (SAC) algorithms. We argue that based on the Maximum Entropy Principle, discrete SAC can be further improved via additional statistical constraints derived from a surrogate critic policy. Furthermore, our findings suggests that these constraints provide an added robustness against potential domain shifts, which are essential for safe deployment of reinforcement learning agents in the real-world. We provide theoretical analysis and show empirical results on low data regimes for both in-distribution and out-of-distribution variants of Atari 2600 games.
    miditok: A Python package for MIDI file tokenization. (arXiv:2310.17202v1 [cs.LG])
    Recent progress in natural language processing has been adapted to the symbolic music modality. Language models, such as Transformers, have been used with symbolic music for a variety of tasks among which music generation, modeling or transcription, with state-of-the-art performances. These models are beginning to be used in production products. To encode and decode music for the backbone model, they need to rely on tokenizers, whose role is to serialize music into sequences of distinct elements called tokens. MidiTok is an open-source library allowing to tokenize symbolic music with great flexibility and extended features. It features the most popular music tokenizations, under a unified API. It is made to be easily used and extensible for everyone.
    A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks. (arXiv:2307.01951v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. Yet, the interplay between graph topology and feature evolution in GNNs is not well understood. In this paper, we focus on node-wise classification, illustrated with community detection on stochastic block model graphs, and explore the feature evolution through the lens of the "Neural Collapse" (NC) phenomenon. When training instance-wise deep classifiers (e.g. for image classification) beyond the zero training error point, NC demonstrates a reduction in the deepest features' within-class variability and an increased alignment of their class means to certain symmetric structures. We start with an empirical study that shows that a decrease in within-class variability is also prevalent in the node-wise classification setting, however, not to the extent observed in the instance-wise case. Then, we theoretically study this distinction. Specifically, we show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse. Interestingly, this condition is viable also for heterophilic graphs and relates to recent empirical studies on settings with improved GNNs' generalization. Furthermore, by studying the gradient dynamics of the theoretical model, we provide reasoning for the partial collapse observed empirically. Finally, we present a study on the evolution of within- and between-class feature variability across layers of a well-trained GNN and contrast the behavior with spectral methods.
    HCT: Hybrid Convnet-Transformer for Parkinson's disease detection and severity prediction from gait. (arXiv:2310.17078v1 [cs.CV])
    In this paper, we propose a novel deep learning method based on a new Hybrid ConvNet-Transformer architecture to detect and stage Parkinson's disease (PD) from gait data. We adopt a two-step approach by dividing the problem into two sub-problems. Our Hybrid ConvNet-Transformer model first distinguishes healthy versus parkinsonian patients. If the patient is parkinsonian, a multi-class Hybrid ConvNet-Transformer model determines the Hoehn and Yahr (H&Y) score to assess the PD severity stage. Our hybrid architecture exploits the strengths of both Convolutional Neural Networks (ConvNets) and Transformers to accurately detect PD and determine the severity stage. In particular, we take advantage of ConvNets to capture local patterns and correlations in the data, while we exploit Transformers for handling long-term dependencies in the input signal. We show that our hybrid method achieves superior performance when compared to other state-of-the-art methods, with a PD detection accuracy of 97% and a severity staging accuracy of 87%. Our source code is available at: https://github.com/SafwenNaimi
    Leveraging Ensemble Diversity for Robust Self-Training in the Presence of Sample Selection Bias. (arXiv:2310.14814v2 [cs.LG] UPDATED)
    Self-training is a well-known approach for semi-supervised learning. It consists of iteratively assigning pseudo-labels to unlabeled data for which the model is confident and treating them as labeled examples. For neural networks, softmax prediction probabilities are often used as a confidence measure, despite the fact that they are known to be overconfident, even for wrong predictions. This phenomenon is particularly intensified in the presence of sample selection bias, i.e., when data labeling is subject to some constraint. To address this issue, we propose a novel confidence measure, called $\mathcal{T}$-similarity, built upon the prediction diversity of an ensemble of linear classifiers. We provide the theoretical analysis of our approach by studying stationary points and describing the relationship between the diversity of the individual members and their performance. We empirically demonstrate the benefit of our confidence measure for three different pseudo-labeling policies on classification datasets of various data modalities.
    Joint Entity and Relation Extraction with Span Pruning and Hypergraph Neural Networks. (arXiv:2310.17238v1 [cs.CL])
    Entity and Relation Extraction (ERE) is an important task in information extraction. Recent marker-based pipeline models achieve state-of-the-art performance, but still suffer from the error propagation issue. Also, most of current ERE models do not take into account higher-order interactions between multiple entities and relations, while higher-order modeling could be beneficial.In this work, we propose HyperGraph neural network for ERE ($\hgnn{}$), which is built upon the PL-marker (a state-of-the-art marker-based pipleline model). To alleviate error propagation,we use a high-recall pruner mechanism to transfer the burden of entity identification and labeling from the NER module to the joint module of our model. For higher-order modeling, we build a hypergraph, where nodes are entities (provided by the span pruner) and relations thereof, and hyperedges encode interactions between two different relations or between a relation and its associated subject and object entities. We then run a hypergraph neural network for higher-order inference by applying message passing over the built hypergraph. Experiments on three widely used benchmarks (\acef{}, \ace{} and \scierc{}) for ERE task show significant improvements over the previous state-of-the-art PL-marker.
    Deep machine learning for meteor monitoring: advances with transfer learning and gradient-weighted class activation mapping. (arXiv:2310.16826v2 [astro-ph.EP] UPDATED)
    In recent decades, the use of optical detection systems for meteor studies has increased dramatically, resulting in huge amounts of data being analyzed. Automated meteor detection tools are essential for studying the continuous meteoroid incoming flux, recovering fresh meteorites, and achieving a better understanding of our Solar System. Concerning meteor detection, distinguishing false positives between meteor and non-meteor images has traditionally been performed by hand, which is significantly time-consuming. To address this issue, we developed a fully automated pipeline that uses Convolutional Neural Networks (CNNs) to classify candidate meteor detections. Our new method is able to detect meteors even in images that contain static elements such as clouds, the Moon, and buildings. To accurately locate the meteor within each frame, we employ the Gradient-weighted Class Activation Mapping (Grad-CAM) technique. This method facilitates the identification of the region of interest by multiplying the activations from the last convolutional layer with the average of the gradients across the feature map of that layer. By combining these findings with the activation map derived from the first convolutional layer, we effectively pinpoint the most probable pixel location of the meteor. We trained and evaluated our model on a large dataset collected by the Spanish Meteor Network (SPMN) and achieved a precision of 98\%. Our new methodology presented here has the potential to reduce the workload of meteor scientists and station operators and improve the accuracy of meteor tracking and classification.
    CEIL: Generalized Contextual Imitation Learning. (arXiv:2306.14534v2 [cs.LG] UPDATED)
    In this paper, we present \textbf{C}ont\textbf{E}xtual \textbf{I}mitation \textbf{L}earning~(CEIL), a general and broadly applicable algorithm for imitation learning (IL). Inspired by the formulation of hindsight information matching, we derive CEIL by explicitly learning a hindsight embedding function together with a contextual policy using the hindsight embeddings. To achieve the expert matching objective for IL, we advocate for optimizing a contextual variable such that it biases the contextual policy towards mimicking expert behaviors. Beyond the typical learning from demonstrations (LfD) setting, CEIL is a generalist that can be effectively applied to multiple settings including: 1)~learning from observations (LfO), 2)~offline IL, 3)~cross-domain IL (mismatched experts), and 4) one-shot IL settings. Empirically, we evaluate CEIL on the popular MuJoCo tasks (online) and the D4RL dataset (offline). Compared to prior state-of-the-art baselines, we show that CEIL is more sample-efficient in most online IL tasks and achieves better or competitive performances in offline tasks.
    Detecting and Mitigating Hallucinations in Multilingual Summarisation. (arXiv:2305.13632v2 [cs.CL] UPDATED)
    Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation. While automatically generated summaries may be fluent, they often lack faithfulness to the original document. This issue becomes even more pronounced in low-resource settings, such as cross-lingual transfer. With the existing faithful metrics focusing on English, even measuring the extent of this phenomenon in cross-lingual settings is hard. To address this, we first develop a novel metric, mFACT, evaluating the faithfulness of non-English summaries, leveraging translation-based transfer from multiple English faithfulness metrics. We then propose a simple but effective method to reduce hallucinations with a cross-lingual transfer, which weighs the loss of each training example by its faithfulness score. Through extensive experiments in multiple languages, we demonstrate that mFACT is the metric that is most suited to detect hallucinations. Moreover, we find that our proposed loss weighting method drastically increases both performance and faithfulness according to both automatic and human evaluation when compared to strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset are available at https://github.com/yfqiu-nlp/mfact-summ.
    Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations. (arXiv:2306.04618v2 [cs.CL] UPDATED)
    This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP. We find that the distribution shift settings in previous studies commonly lack adequate challenges, hindering the accurate evaluation of OOD robustness. To address these issues, we propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we conduct a series of experiments on pre-trained language models for analysis and evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the relationship between in-distribution (ID) and OOD performance. We identify three typical types that unveil the inner learning mechanism, which could potentially facilitate the forecasting of OOD robustness, correlating with the advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and find that, despite exhibiting some effectiveness in specific cases, they do not offer significant improvement compared to vanilla fine-tuning. Further, we evaluate 5 LLMs with various adaptation paradigms and find that when sufficient ID data is available, fine-tuning domain-specific models outperform LLMs on ID examples significantly. However, in the case of OOD instances, prioritizing LLMs with in-context learning yields better results. We identify that both fine-tuned small models and LLMs face challenges in effectively addressing downstream tasks. The code is public at \url{https://github.com/lifan-yuan/OOD_NLP}.
    Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory. (arXiv:2307.04204v2 [cs.LG] UPDATED)
    Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature.
    Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods. (arXiv:2305.12283v2 [cs.LG] UPDATED)
    In this paper, we consider the uncertainty quantification problem for regression models. Specifically, we consider an individual calibration objective for characterizing the quantiles of the prediction model. While such an objective is well-motivated from downstream tasks such as newsvendor cost, the existing methods have been largely heuristic and lack of statistical guarantee in terms of individual calibration. We show via simple examples that the existing methods focusing on population-level calibration guarantees such as average calibration or sharpness can lead to harmful and unexpected results. We propose simple nonparametric calibration methods that are agnostic of the underlying prediction model and enjoy both computational efficiency and statistical consistency. Our approach enables a better understanding of the possibility of individual calibration, and we establish matching upper and lower bounds for the calibration error of our proposed methods. Technically, our analysis combines the nonparametric analysis with a covering number argument for parametric analysis, which advances the existing theoretical analyses in the literature of nonparametric density estimation and quantile bandit problems. Importantly, the nonparametric perspective sheds new theoretical insights into regression calibration in terms of the curse of dimensionality and reconciles the existing results on the impossibility of individual calibration. To our knowledge, we make the first effort to reach both individual calibration and finite-sample guarantee with minimal assumptions in terms of conformal prediction. Numerical experiments show the advantage of such a simple approach under various metrics, and also under covariates shift. We hope our work provides a simple benchmark and a starting point of theoretical ground for future research on regression calibration.
    Neural (Tangent Kernel) Collapse. (arXiv:2305.16427v2 [cs.LG] UPDATED)
    This work bridges two important concepts: the Neural Tangent Kernel (NTK), which captures the evolution of deep neural networks (DNNs) during training, and the Neural Collapse (NC) phenomenon, which refers to the emergence of symmetry and structure in the last-layer features of well-trained classification DNNs. We adopt the natural assumption that the empirical NTK develops a block structure aligned with the class labels, i.e., samples within the same class have stronger correlations than samples from different classes. Under this assumption, we derive the dynamics of DNNs trained with mean squared (MSE) loss and break them into interpretable phases. Moreover, we identify an invariant that captures the essence of the dynamics, and use it to prove the emergence of NC in DNNs with block-structured NTK. We provide large-scale numerical experiments on three common DNN architectures and three benchmark datasets to support our theory.
    A Batch-to-Online Transformation under Random-Order Model. (arXiv:2306.07163v2 [cs.LG] UPDATED)
    We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.
    A Comprehensive Study of Groundbreaking Machine Learning Research: Analyzing highly cited and impactful publications across six decades. (arXiv:2308.00855v2 [cs.DL] UPDATED)
    Machine learning (ML) has emerged as a prominent field of research in computer science and other related fields, thereby driving advancements in other domains of interest. As the field continues to evolve, it is crucial to understand the landscape of highly cited publications to identify key trends, influential authors, and significant contributions made thus far. In this paper, we present a comprehensive bibliometric analysis of highly cited ML publications. We collected a dataset consisting of the top-cited papers from reputable ML conferences and journals, covering a period of several years from 1959 to 2022. We employed various bibliometric techniques to analyze the data, including citation analysis, co-authorship analysis, keyword analysis, and publication trends. Our findings reveal the most influential papers, highly cited authors, and collaborative networks within the machine learning community. We identify popular research themes and uncover emerging topics that have recently gained significant attention. Furthermore, we examine the geographical distribution of highly cited publications, highlighting the dominance of certain countries in ML research. By shedding light on the landscape of highly cited ML publications, our study provides valuable insights for researchers, policymakers, and practitioners seeking to understand the key developments and trends in this rapidly evolving field.
    On Performance Discrepancies Across Local Homophily Levels in Graph Neural Networks. (arXiv:2306.05557v3 [cs.SI] UPDATED)
    Graph Neural Network (GNN) research has highlighted a relationship between high homophily (i.e., the tendency of nodes of the same class to connect) and strong predictive performance in node classification. However, recent work has found the relationship to be more nuanced, demonstrating that simple GNNs can learn in certain heterophilous settings. To resolve these conflicting findings and align closer to real-world datasets, we go beyond the assumption of a global graph homophily level and study the performance of GNNs when the local homophily level of a node deviates from the global homophily level. Through theoretical and empirical analysis, we systematically demonstrate how shifts in local homophily can introduce performance degradation, leading to performance discrepancies across local homophily levels. We ground the practical implications of this work through granular analysis on five real-world datasets with varying global homophily levels, demonstrating that (a) GNNs can fail to generalize to test nodes that deviate from the global homophily of a graph, and (b) high local homophily does not necessarily confer high performance for a node. We further show that GNNs designed for globally heterophilous graphs can alleviate performance discrepancy by improving performance across local homophily levels, offering a new perspective on how these GNNs achieve stronger global performance.
    Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis. (arXiv:2306.08645v2 [cs.CV] UPDATED)
    Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis. Abiding by the tradition in deep learning, DMs are trained and evaluated on the images with fixed sizes. However, users are demanding for various images with specific sizes and various aspect ratio. This paper focuses on adapting text-to-image diffusion models to handle such variety while maintaining visual fidelity. First we observe that, during the synthesis, lower resolution images suffer from incomplete object portrayal, while higher resolution images exhibit repetitively disordered presentation. Next, we establish a statistical relationship indicating that attention entropy changes with token quantity, suggesting that models aggregate spatial information in proportion to image resolution. The subsequent interpretation on our observations is that objects are incompletely depicted due to limited spatial information for low resolutions, while repetitively disorganized presentation arises from redundant spatial information for high resolutions. From this perspective, we propose a scaling factor to alleviate the change of attention entropy and mitigate the defective pattern observed. Extensive experimental results validate the efficacy of the proposed scaling factor, enabling models to achieve better visual effects, image quality, and text alignment. Notably, these improvements are achieved without additional training or fine-tuning techniques.
    Improving Multimodal Datasets with Image Captioning. (arXiv:2307.10350v2 [cs.LG] UPDATED)
    Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate that the performance of a model on standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable indicator of the utility of the captions it generates for multimodal training. Finally, our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text, as well as the importance of image curation with increasing training data quantity. The synthetic captions used in our experiments are now available on HuggingFace.
    Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with Application to Fairness. (arXiv:2305.15807v2 [stat.ML] UPDATED)
    We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated -- a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.
    DSAC-T: Distributional Soft Actor-Critic with Three Refinements. (arXiv:2310.05858v3 [cs.LG] UPDATED)
    Reinforcement learning (RL) has proven to be highly effective in tackling complex decision-making and control tasks. However, prevalent model-free RL methods often face severe performance degradation due to the well-known overestimation issue. In response to this problem, we recently introduced an off-policy RL algorithm, called distributional soft actor-critic (DSAC or DSAC-v1), which can effectively improve the value estimation accuracy by learning a continuous Gaussian value distribution. Nonetheless, standard DSAC has its own shortcomings, including occasionally unstable learning processes and needs for task-specific reward scaling, which may hinder its overall performance and adaptability in some special tasks. This paper further introduces three important refinements to standard DSAC in order to address these shortcomings. These refinements consist of critic gradient adjusting, twin value distribution learning, and variance-based target return clipping. The modified RL algorithm is named as DSAC with three refinements (DSAC-T or DSAC-v2), and its performances are systematically evaluated on a diverse set of benchmark tasks. Without any task-specific hyperparameter tuning, DSAC-T surpasses a lot of mainstream model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T, unlike its standard version, ensures a highly stable learning process and delivers similar performance across varying reward scales.
    Driving through the Concept Gridlock: Unraveling Explainability Bottlenecks in Automated Driving. (arXiv:2310.16639v2 [cs.CV] UPDATED)
    Concept bottleneck models have been successfully used for explainable machine learning by encoding information within the model with a set of human-defined concepts. In the context of human-assisted or autonomous driving, explainability models can help user acceptance and understanding of decisions made by the autonomous vehicle, which can be used to rationalize and explain driver or vehicle behavior. We propose a new approach using concept bottlenecks as visual features for control command predictions and explanations of user and vehicle behavior. We learn a human-understandable concept layer that we use to explain sequential driving scenes while learning vehicle control commands. This approach can then be used to determine whether a change in a preferred gap or steering commands from a human (or autonomous vehicle) is led by an external stimulus or change in preferences. We achieve competitive performance to latent visual features while gaining interpretability within our model setup.
    Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. (arXiv:2305.14077v2 [stat.ML] UPDATED)
    The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.
    SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from Diffusion Models. (arXiv:2305.14267v2 [cs.LG] UPDATED)
    A potent class of generative models known as Diffusion Probabilistic Models (DPMs) has become prominent. A forward diffusion process adds gradually noise to data, while a model learns to gradually denoise. Sampling from pre-trained DPMs is obtained by solving differential equations (DE) defined by the learnt model, a process which has shown to be prohibitively slow. Numerous efforts on speeding-up this process have consisted on crafting powerful ODE solvers. Despite being quick, such solvers do not usually reach the optimal quality achieved by available slow SDE solvers. Our goal is to propose SDE solvers that reach optimal quality without requiring several hundreds or thousands of NFEs to achieve that goal. We propose Stochastic Explicit Exponential Derivative-free Solvers (SEEDS), improving and generalizing Exponential Integrator approaches to the stochastic case on several frameworks. After carefully analyzing the formulation of exact solutions of diffusion SDEs, we craft SEEDS to analytically compute the linear part of such solutions. Inspired by the Exponential Time-Differencing method, SEEDS use a novel treatment of the stochastic components of solutions, enabling the analytical computation of their variance, and contains high-order terms allowing to reach optimal quality sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our approach on several image generation benchmarks, showing that SEEDS outperform or are competitive with previous SDE solvers. Contrary to the latter, SEEDS are derivative and training free, and we fully prove strong convergence guarantees for them.
    Gaussian Membership Inference Privacy. (arXiv:2306.07273v2 [cs.LG] UPDATED)
    We propose a novel and practical privacy notion called $f$-Membership Inference Privacy ($f$-MIP), which explicitly considers the capabilities of realistic adversaries under the membership inference attack threat model. Consequently, $f$-MIP offers interpretable privacy guarantees and improved utility (e.g., better classification accuracy). In particular, we derive a parametric family of $f$-MIP guarantees that we refer to as $\mu$-Gaussian Membership Inference Privacy ($\mu$-GMIP) by theoretically analyzing likelihood ratio-based membership inference attacks on stochastic gradient descent (SGD). Our analysis highlights that models trained with standard SGD already offer an elementary level of MIP. Additionally, we show how $f$-MIP can be amplified by adding noise to gradient updates. Our analysis further yields an analytical membership inference attack that offers two distinct advantages over previous approaches. First, unlike existing state-of-the-art attacks that require training hundreds of shadow models, our attack does not require any shadow model. Second, our analytical attack enables straightforward auditing of our privacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g., batch size, number of model parameters) and specific data characteristics determine an attacker's ability to accurately infer a point's membership in the training set. We demonstrate the effectiveness of our method on models trained on vision and tabular datasets.
    Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures. (arXiv:2306.15012v2 [stat.ML] UPDATED)
    Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.
    Adaptive important sampling for Deep Ritz. (arXiv:2310.17185v1 [cs.LG])
    We introduce an adaptive sampling method for the Deep Ritz method aimed at solving partial differential equations (PDEs). Two deep neural networks are used. One network is employed to approximate the solution of PDEs, while the other one is a deep generative model used to generate new collocation points to refine the training set. The adaptive sampling procedure consists of two main steps. The first step is solving the PDEs using the Deep Ritz method by minimizing an associated variational loss discretized by the collocation points in the training set. The second step involves generating a new training set, which is then used in subsequent computations to further improve the accuracy of the current approximate solution. We treat the integrand in the variational loss as an unnormalized probability density function (PDF) and approximate it using a deep generative model called bounded KRnet. The new samples and their associated PDF values are obtained from the bounded KRnet. With these new samples and their associated PDF values, the variational loss can be approximated more accurately by importance sampling. Compared to the original Deep Ritz method, the proposed adaptive method improves accuracy, especially for problems characterized by low regularity and high dimensionality. We demonstrate the effectiveness of our new method through a series of numerical experiments.
    The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models. (arXiv:2303.03284v3 [cs.LG] UPDATED)
    Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
    RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion. (arXiv:2302.01757v2 [cs.CR] UPDATED)
    Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.
    Investigating Topological Order using Recurrent Neural Networks. (arXiv:2303.11207v3 [cond-mat.str-el] UPDATED)
    Recurrent neural networks (RNNs), originally developed for natural language processing, hold great promise for accurately describing strongly correlated quantum many-body systems. Here, we employ 2D RNNs to investigate two prototypical quantum many-body Hamiltonians exhibiting topological order. Specifically, we demonstrate that RNN wave functions can effectively capture the topological order of the toric code and a Bose-Hubbard spin liquid on the kagome lattice by estimating their topological entanglement entropies. We also find that RNNs favor coherent superpositions of minimally-entangled states over minimally-entangled states themselves. Overall, our findings demonstrate that RNN wave functions constitute a powerful tool to study phases of matter beyond Landau's symmetry-breaking paradigm.
    What Makes Data Suitable for a Locally Connected Neural Network? A Necessary and Sufficient Condition Based on Quantum Entanglement. (arXiv:2303.11249v4 [cs.LG] UPDATED)
    The question of what makes a data distribution suitable for deep learning is a fundamental open problem. Focusing on locally connected neural networks (a prevalent family of architectures that includes convolutional and recurrent neural networks as well as local self-attention models), we address this problem by adopting theoretical tools from quantum physics. Our main theoretical result states that a certain locally connected neural network is capable of accurate prediction over a data distribution if and only if the data distribution admits low quantum entanglement under certain canonical partitions of features. As a practical application of this result, we derive a preprocessing method for enhancing the suitability of a data distribution to locally connected neural networks. Experiments with widespread models over various datasets demonstrate our findings. We hope that our use of quantum entanglement will encourage further adoption of tools from physics for formally reasoning about the relation between deep learning and real-world data.
    Self-Evaluation Guided Beam Search for Reasoning. (arXiv:2305.00633v3 [cs.CL] UPDATED)
    Breaking down a problem into intermediate steps has demonstrated impressive performance in Large Language Model (LLM) reasoning. However, the growth of the reasoning chain introduces uncertainty and error accumulation, making it challenging to elicit accurate final results. To tackle this challenge of uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of LLMs. We propose a decoding algorithm integrating the self-evaluation guidance via stochastic beam search. The self-evaluation guidance serves as a better-calibrated automatic criterion, facilitating an efficient search in the reasoning space and resulting in superior prediction quality. Stochastic beam search balances exploitation and exploration of the search space with temperature-controlled randomness. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQuA, and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on arithmetic reasoning demonstrate the efficiency of our method in outperforming the baseline methods with comparable computational budgets. Further analysis in multi-step reasoning finds our self-evaluation guidance pinpoints logic failures and leads to higher consistency and robustness. Our code is publicly available at https://guideddecoding.github.io/.
    Improving the Timing Resolution of Positron Emission Tomography Detectors Using Boosted Learning -- A Residual Physics Approach. (arXiv:2302.01681v2 [cs.LG] UPDATED)
    Artificial intelligence (AI) is entering medical imaging, mainly enhancing image reconstruction. Nevertheless, improvements throughout the entire processing, from signal detection to computation, potentially offer significant benefits. This work presents a novel and versatile approach to detector optimization using machine learning (ML) and residual physics. We apply the concept to positron emission tomography (PET), intending to improve the coincidence time resolution (CTR). PET visualizes metabolic processes in the body by detecting photons with scintillation detectors. Improved CTR performance offers the advantage of reducing radioactive dose exposure for patients. Modern PET detectors with sophisticated concepts and read-out topologies represent complex physical and electronic systems requiring dedicated calibration techniques. Traditional methods primarily depend on analytical formulations successfully describing the main detector characteristics. However, when accounting for higher-order effects, additional complexities arise matching theoretical models to experimental reality. Our work addresses this challenge by combining traditional calibration with AI and residual physics, presenting a highly promising approach. We present a residual physics-based strategy using gradient tree boosting and physics-guided data generation. The explainable AI framework SHapley Additive exPlanations (SHAP) was used to identify known physical effects with learned patterns. In addition, the models were tested against basic physical laws. We were able to improve the CTR significantly (more than 20%) for clinically relevant detectors of 19 mm height, reaching CTRs of 185 ps (450-550 keV).
    Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation. (arXiv:2305.11685v2 [eess.AS] UPDATED)
    Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark.
    Large-Scale Gaussian Processes via Alternating Projection. (arXiv:2310.17137v1 [cs.LG])
    Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.
    Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning. (arXiv:2310.17139v1 [cs.LG])
    While bisimulation-based approaches hold promise for learning robust state representations for Reinforcement Learning (RL) tasks, their efficacy in offline RL tasks has not been up to par. In some instances, their performance has even significantly underperformed alternative methods. We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks. Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle, leading to ineffective estimation. We also shed light on the critical role of reward scaling in bounding the scale of bisimulation measurements and of the value error they induce. Based on these findings, we propose to apply the expectile operator for representation learning to our offline RL setting, which helps to prevent overfitting to incomplete data. Meanwhile, by introducing an appropriate reward scaling strategy, we avoid the risk of feature collapse in representation space. We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR, and demonstrate performance gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at \url{https://github.com/zanghyu/Offline_Bisimulation}.
    Beyond MLE: Convex Learning for Text Generation. (arXiv:2310.17217v1 [cs.CL])
    Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.
    Fairness and bias correction in machine learning for depression prediction: results from four study populations. (arXiv:2211.05321v3 [cs.LG] UPDATED)
    A significant level of stigma and inequality exists in mental healthcare, especially in under-served populations. Inequalities are reflected in the data collected for scientific purposes. When not properly accounted for, machine learning (ML) models leart from data can reinforce these structural inequalities or biases. Here, we present a systematic study of bias in ML models designed to predict depression in four different case studies covering different countries and populations. We find that standard ML approaches show regularly biased behaviors. We also show that mitigation techniques, both standard and our own post-hoc method, can be effective in reducing the level of unfair bias. No single best ML model for depression prediction provides equality of outcomes. This emphasizes the importance of analyzing fairness during model selection and transparent reporting about the impact of debiasing interventions. Finally, we provide practical recommendations to develop bias-aware ML models for depression risk prediction.
    Learning Rate Free Bayesian Inference in Constrained Domains. (arXiv:2305.14943v2 [stat.ML] UPDATED)
    We introduce a suite of new particle-based algorithms for sampling on constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.
    Simulation based stacking. (arXiv:2310.17009v1 [stat.ME])
    Simulation-based inference has been popular for amortized Bayesian computation. It is typical to have more than one posterior approximation, from different inference algorithms, different architectures, or simply the randomness of initialization and stochastic gradients. With a provable asymptotic guarantee, we present a general stacking framework to make use of all available posterior approximations. Our stacking method is able to combine densities, simulation draws, confidence intervals, and moments, and address the overall precision, calibration, coverage, and bias at the same time. We illustrate our method on several benchmark simulations and a challenging cosmological inference task.
    Automating lichen monitoring in ecological studies using instance segmentation of time-lapse images. (arXiv:2310.17080v1 [cs.CV])
    Lichens are symbiotic organisms composed of fungi, algae, and/or cyanobacteria that thrive in a variety of environments. They play important roles in carbon and nitrogen cycling, and contribute directly and indirectly to biodiversity. Ecologists typically monitor lichens by using them as indicators to assess air quality and habitat conditions. In particular, epiphytic lichens, which live on trees, are key markers of air quality and environmental health. A new method of monitoring epiphytic lichens involves using time-lapse cameras to gather images of lichen populations. These cameras are used by ecologists in Newfoundland and Labrador to subsequently analyze and manually segment the images to determine lichen thalli condition and change. These methods are time-consuming and susceptible to observer bias. In this work, we aim to automate the monitoring of lichens over extended periods and to estimate their biomass and condition to facilitate the task of ecologists. To accomplish this, our proposed framework uses semantic segmentation with an effective training approach to automate monitoring and biomass estimation of epiphytic lichens on time-lapse images. We show that our method has the potential to significantly improve the accuracy and efficiency of lichen population monitoring, making it a valuable tool for forest ecologists and environmental scientists to evaluate the impact of climate change on Canada's forests. To the best of our knowledge, this is the first time that such an approach has been used to assist ecologists in monitoring and analyzing epiphytic lichens.
    ZipLM: Inference-Aware Structured Pruning of Language Models. (arXiv:2302.04089v2 [cs.LG] UPDATED)
    The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference environment. Specifically, given a model, a dataset, an inference environment, as well as a set of speedup targets, ZipLM iteratively identifies and removes components with the worst loss-runtime trade-off. Unlike prior methods that specialize in either the post-training/one-shot or the gradual compression setting, and only for specific families of models such as BERT (encoder) or GPT (decoder), ZipLM produces state-of-the-art compressed models across all these settings. Furthermore, ZipLM achieves superior results for a fraction of the computational cost relative to prior distillation and pruning techniques, making it a cost-effective approach for generating an entire family of smaller, faster, and highly accurate models, guaranteed to meet the desired inference specifications. In particular, ZipLM outperforms all prior BERT-base distillation and pruning techniques, such as CoFi, MiniLM, and TinyBERT. Moreover, it matches the performance of the heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large model. When compressing GPT2, ZipLM outperforms DistilGPT2 while being 60% smaller and 30% faster. Our code is available at: https://github.com/IST-DASLab/ZipLM.
    Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. (arXiv:2310.17087v1 [cs.LG])
    Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, namely that these implicit biases are in fact various tips of the same iceberg. They occur when the objective function of optimization has some good regularity, which, in combination with a provable preference of large learning rate gradient descent for moving toward flatter regions, results in these nontrivial dynamical phenomena. To establish this result, we develop a new global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. A byproduct is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. We also validate our theory with experiments on neural networks, where different losses, activation functions, and batch normalization all can significantly affect regularity and lead to very different training dynamics.
    Explainable Spatio-Temporal Graph Neural Networks. (arXiv:2310.17149v1 [cs.LG])
    Spatio-temporal graph neural networks (STGNNs) have gained popularity as a powerful tool for effectively modeling spatio-temporal dependencies in diverse real-world urban applications, including intelligent transportation and public safety. However, the black-box nature of STGNNs limits their interpretability, hindering their application in scenarios related to urban resource allocation and policy formulation. To bridge this gap, we propose an Explainable Spatio-Temporal Graph Neural Networks (STExplainer) framework that enhances STGNNs with inherent explainability, enabling them to provide accurate predictions and faithful explanations simultaneously. Our framework integrates a unified spatio-temporal graph attention network with a positional information fusion layer as the STG encoder and decoder, respectively. Furthermore, we propose a structure distillation approach based on the Graph Information Bottleneck (GIB) principle with an explainable objective, which is instantiated by the STG encoder and decoder. Through extensive experiments, we demonstrate that our STExplainer outperforms state-of-the-art baselines in terms of predictive accuracy and explainability metrics (i.e., sparsity and fidelity) on traffic and crime prediction tasks. Furthermore, our model exhibits superior representation ability in alleviating data missing and sparsity issues. The implementation code is available at: https://github.com/HKUDS/STExplainer.
    math-PVS: A Large Language Model Framework to Map Scientific Publications to PVS Theories. (arXiv:2310.17064v1 [cs.AI])
    As artificial intelligence (AI) gains greater adoption in a wide variety of applications, it has immense potential to contribute to mathematical discovery, by guiding conjecture generation, constructing counterexamples, assisting in formalizing mathematics, and discovering connections between different mathematical areas, to name a few. While prior work has leveraged computers for exhaustive mathematical proof search, recent efforts based on large language models (LLMs) aspire to position computing platforms as co-contributors in the mathematical research process. Despite their current limitations in logic and mathematical tasks, there is growing interest in melding theorem proving systems with foundation models. This work investigates the applicability of LLMs in formalizing advanced mathematical concepts and proposes a framework that can critically review and check mathematical reasoning in research papers. Given the noted reasoning shortcomings of LLMs, our approach synergizes the capabilities of proof assistants, specifically PVS, with LLMs, enabling a bridge between textual descriptions in academic papers and formal specifications in PVS. By harnessing the PVS environment, coupled with data ingestion and conversion mechanisms, we envision an automated process, called \emph{math-PVS}, to extract and formalize mathematical theorems from research papers, offering an innovative tool for academic review and discovery.
    LLM4DyG: Can Large Language Models Solve Problems on Dynamic Graphs?. (arXiv:2310.17110v1 [cs.LG])
    In an era marked by the increasing adoption of Large Language Models (LLMs) for various tasks, there is a growing focus on exploring LLMs' capabilities in handling web data, particularly graph data. Dynamic graphs, which capture temporal network evolution patterns, are ubiquitous in real-world web data. Evaluating LLMs' competence in understanding spatial-temporal information on dynamic graphs is essential for their adoption in web applications, which remains unexplored in the literature. In this paper, we bridge the gap via proposing to evaluate LLMs' spatial-temporal understanding abilities on dynamic graphs, to the best of our knowledge, for the first time. Specifically, we propose the LLM4DyG benchmark, which includes nine specially designed tasks considering the capability evaluation of LLMs from both temporal and spatial dimensions. Then, we conduct extensive experiments to analyze the impacts of different data generators, data statistics, prompting techniques, and LLMs on the model performance. Finally, we propose Disentangled Spatial-Temporal Thoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporal understanding abilities. Our main observations are: 1) LLMs have preliminary spatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graph tasks show increasing difficulties for LLMs as the graph size and density increase, while not sensitive to the time span and data generation mechanism, 3) the proposed DST2 prompting method can help to improve LLMs' spatial-temporal understanding abilities on dynamic graphs for most tasks. The data and codes will be open-sourced at publication time.
    Hierarchical Semi-Implicit Variational Inference with Application to Diffusion Model Acceleration. (arXiv:2310.17153v1 [cs.LG])
    Semi-implicit variational inference (SIVI) has been introduced to expand the analytical variational families by defining expressive semi-implicit distributions in a hierarchical manner. However, the single-layer architecture commonly used in current SIVI methods can be insufficient when the target posterior has complicated structures. In this paper, we propose hierarchical semi-implicit variational inference, called HSIVI, which generalizes SIVI to allow more expressive multi-layer construction of semi-implicit distributions. By introducing auxiliary distributions that interpolate between a simple base distribution and the target distribution, the conditional layers can be trained by progressively matching these auxiliary distributions one layer after another. Moreover, given pre-trained score networks, HSIVI can be used to accelerate the sampling process of diffusion models with the score matching objective. We show that HSIVI significantly enhances the expressiveness of SIVI on several Bayesian inference problems with complicated target distributions. When used for diffusion model acceleration, we show that HSIVI can produce high quality samples comparable to or better than the existing fast diffusion model based samplers with a small number of function evaluations on various datasets.
    Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time. (arXiv:2310.17157v1 [cs.LG])
    Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.
    Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity. (arXiv:2310.17247v1 [cs.LG])
    In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression and linear regression. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures provides evidence that grokking is not specific to SGD or weight norm regularisation. Instead, grokking may be possible in any setting where solution search is guided by complexity and error. Based on this insight and further trends we see in the training trajectories of a Bayesian neural network (BNN) and GP regression model, we make progress towards a more general theory of grokking. Specifically, we hypothesise that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.
    Bayesian Neural Networks for Geothermal Resource Assessment: Prediction with Uncertainty. (arXiv:2209.15543v3 [physics.geo-ph] UPDATED)
    We consider the application of machine learning to the evaluation of geothermal resource potential. A supervised learning problem is defined where maps of 10 geological and geophysical features within the state of Nevada, USA are used to define geothermal potential across a broad region. We have available a relatively small set of positive training sites (known resources or active power plants) and negative training sites (known drill sites with unsuitable geothermal conditions) and use these to constrain and optimize artificial neural networks for this classification task. The main objective is to predict the geothermal resource potential at unknown sites within a large geographic area where the defining features are known. These predictions could be used to target promising areas for further detailed investigations. We describe the evolution of our work from defining a specific neural network architecture to training and optimization trials. Upon analysis we expose the inevitable problems of model variability and resulting prediction uncertainty. Finally, to address these problems we apply the concept of Bayesian neural networks, a heuristic approach to regularization in network training, and make use of the practical interpretation of the formal uncertainty measures they provide.
    Topic Segmentation of Semi-Structured and Unstructured Conversational Datasets using Language Models. (arXiv:2310.17120v1 [cs.CL])
    Breaking down a document or a conversation into multiple contiguous segments based on its semantic structure is an important and challenging problem in NLP, which can assist many downstream tasks. However, current works on topic segmentation often focus on segmentation of structured texts. In this paper, we comprehensively analyze the generalization capabilities of state-of-the-art topic segmentation models on unstructured texts. We find that: (a) Current strategies of pre-training on a large corpus of structured text such as Wiki-727K do not help in transferability to unstructured conversational data. (b) Training from scratch with only a relatively small-sized dataset of the target unstructured domain improves the segmentation results by a significant margin. We stress-test our proposed Topic Segmentation approach by experimenting with multiple loss functions, in order to mitigate effects of imbalance in unstructured conversational datasets. Our empirical evaluation indicates that Focal Loss function is a robust alternative to Cross-Entropy and re-weighted Cross-Entropy loss function when segmenting unstructured and semi-structured chats.
    Codebook Features: Sparse and Discrete Interpretability for Neural Networks. (arXiv:2310.17230v1 [cs.LG])
    Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.
    MaxEnt Loss: Constrained Maximum Entropy for Calibration under Out-of-Distribution Shift. (arXiv:2310.17159v1 [cs.LG])
    We present a new loss function that addresses the out-of-distribution (OOD) calibration problem. While many objective functions have been proposed to effectively calibrate models in-distribution, our findings show that they do not always fare well OOD. Based on the Principle of Maximum Entropy, we incorporate helpful statistical constraints observed during training, delivering better model calibration without sacrificing accuracy. We provide theoretical analysis and show empirically that our method works well in practice, achieving state-of-the-art calibration on both synthetic and real-world benchmarks.
    Improving Neural Additive Models with Bayesian Principles. (arXiv:2305.16905v2 [stat.ML] UPDATED)
    Neural additive models (NAMs) can improve the interpretability of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we enhance them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) enabling a ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.
    Language-based Action Concept Spaces Improve Video Self-Supervised Learning. (arXiv:2307.10922v3 [cs.CV] UPDATED)
    Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.
    Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation. (arXiv:2307.07907v2 [cs.LG] UPDATED)
    Robustness has been extensively studied in reinforcement learning (RL) to handle various forms of uncertainty such as random perturbations, rare events, and malicious attacks. In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity. A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one. Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and causal structure, is difficult to characterize and identify. Existing robust algorithms that assume simple and unstructured uncertainty sets are therefore inadequate to address this challenge. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in avoiding learning spurious correlations compared with other robust RL counterparts. We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks.
    MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models. (arXiv:2310.02255v2 [cs.CV] UPDATED)
    Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/.
    Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline RL. (arXiv:2306.04220v4 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL
    A framework for benchmarking clustering algorithms. (arXiv:2209.09493v3 [cs.LG] UPDATED)
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .
    COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers. (arXiv:2309.01270v2 [cs.CV] UPDATED)
    We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.
    Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models. (arXiv:2310.17530v1 [cs.CV])
    Pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. Therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. In this work, we define gender bias as our case study. We quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models. We investigate the connection, if any, between the two learning stages, and evaluate how bias amplification reflects on model performance. Overall, we find that bias amplification in pretraining and after fine-tuning are independent. We then examine the effect of continued pretraining on gender-neutral data, finding that this reduces group disparities, i.e., promotes fairness, on VQAv2 and retrieval tasks without significantly compromising task performance.
    Bifurcations and loss jumps in RNN training. (arXiv:2310.17561v1 [cs.LG])
    Recurrent neural networks (RNNs) are popular machine learning tools for modeling and forecasting sequential data and for inferring dynamical systems (DS) from observed time series. Concepts from DS theory (DST) have variously been used to further our understanding of both, how trained RNNs solve complex tasks, and the training process itself. Bifurcations are particularly important phenomena in DS, including RNNs, that refer to topological (qualitative) changes in a system's dynamical behavior as one or more of its parameters are varied. Knowing the bifurcation structure of an RNN will thus allow to deduce many of its computational and dynamical properties, like its sensitivity to parameter variations or its behavior during training. In particular, bifurcations may account for sudden loss jumps observed in RNN training that could severely impede the training process. Here we first mathematically prove for a particular class of ReLU-based RNNs that certain bifurcations are indeed associated with loss gradients tending toward infinity or zero. We then introduce a novel heuristic algorithm for detecting all fixed points and k-cycles in ReLU-based RNNs and their existence and stability regions, hence bifurcation manifolds in parameter space. In contrast to previous numerical algorithms for finding fixed points and common continuation methods, our algorithm provides exact results and returns fixed points and cycles up to high orders with surprisingly good scaling behavior. We exemplify the algorithm on the analysis of the training process of RNNs, and find that the recently introduced technique of generalized teacher forcing completely avoids certain types of bifurcations in training. Thus, besides facilitating the DST analysis of trained RNNs, our algorithm provides a powerful instrument for analyzing the training process itself.
    Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction. (arXiv:2301.08951v4 [cs.CV] UPDATED)
    When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when an object is completely occluded from certain viewpoints. Meanwhile, humans are able to imagine novel views after observing multiple viewpoints. Recent remarkable advances in multi-view object-centric learning still leaves some unresolved problems: 1) The shapes of partially or completely occluded objects can not be well reconstructed. 2) The novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit rules in view representations. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of an object accurately, we enhance the disentanglement between the latent representations of objects and views, where the latent representations of time-conditioned views are jointly inferred with a Transformer and then are input to a sequential extension of Slot Attention to learn object-centric representations. In addition, Gaussian processes are employed as priors of view latent variables for video generation and novel-view prediction without viewpoint annotations. Experiments on multiple datasets demonstrate that the proposed model can make object-centric video decomposition, reconstruct the complete shapes of occluded objects, and make novel-view predictions.
    Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty. (arXiv:2301.11588v2 [stat.ML] UPDATED)
    In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method outperforms compared with existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.
    Trust, but Verify: Robust Image Segmentation using Deep Learning. (arXiv:2310.16999v1 [cs.CV])
    We describe a method for verifying the output of a deep neural network for medical image segmentation that is robust to several classes of random as well as worst-case perturbations i.e. adversarial attacks. This method is based on a general approach recently developed by the authors called ``Trust, but Verify" wherein an auxiliary verification network produces predictions about certain masked features in the input image using the segmentation as an input. A well-designed auxiliary network will produce high-quality predictions when the input segmentations are accurate, but will produce low-quality predictions when the segmentations are incorrect. Checking the predictions of such a network with the original image allows us to detect bad segmentations. However, to ensure the verification method is truly robust, we need a method for checking the quality of the predictions that does not itself rely on a black-box neural network. Indeed, we show that previous methods for segmentation evaluation that do use deep neural regression networks are vulnerable to false negatives i.e. can inaccurately label bad segmentations as good. We describe the design of a verification network that avoids such vulnerability and present results to demonstrate its robustness compared to previous methods.
    The statistical thermodynamics of generative diffusion models. (arXiv:2310.17467v1 [stat.ML])
    Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.
    Bias in Evaluation Processes: An Optimization-Based Model. (arXiv:2310.17489v1 [cs.CY])
    Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.
    Transferring a molecular foundation model for polymer property predictions. (arXiv:2310.16958v1 [cs.LG])
    Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and materials discovery. Self-supervised pretraining of transformer models requires large-scale datasets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incurs extra computational costs. In contrast, large-scale open-source datasets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this work, we show that using transformers pretrained on small molecules and fine-tuned on polymer properties achieve comparable accuracy to those trained on augmented polymer datasets for a series of benchmark prediction tasks.
    BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs' Generation. (arXiv:2310.17054v1 [cs.CL])
    Large language models (LLMs) such as GPT-3 have demonstrated a strong capability to generate coherent and contextually relevant text. However, amidst their successes, a crucial issue persists: their generated outputs still lack commonsense at times. Moreover, fine-tuning the entire LLM towards more commonsensical outputs is computationally expensive if not infeasible. In this paper, we present a computation-efficient framework that steers a frozen Pre-Trained Language Model (PTLM) towards more commonsensical generation (i.e., producing a plausible output that incorporates a list of concepts in a meaningful way). Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score by grounding the sentence to a dynamic commonsense knowledge base from four different relational aspects. We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head that guides a fixed PTLM to better satisfy the oracle. We test our framework on a series of GPT-2-, Flan-T5-, and Alpaca-based language models (LMs) on two constrained concept-to-sentence benchmarks. Human evaluation results demonstrate that our method consistently leads to the most commonsensical outputs.
    On the Convergence of CART under Sufficient Impurity Decrease Condition. (arXiv:2310.17114v1 [stat.ML])
    The decision tree is a flexible machine learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we investigate the convergence rate of CART under a regression setting. First, we establish an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2022asymptotic} -- our result improves upon the known result by \cite{chi2022asymptotic} under a similar assumption. Furthermore, we provide examples that demonstrate the error bound cannot be further improved by more than a constant or a logarithmic factor. Second, we introduce a set of easily verifiable sufficient conditions for the SID condition. Specifically, we demonstrate that the SID condition can be satisfied in the case of an additive model, provided that the component functions adhere to a ``locally reverse Poincar{\'e} inequality". We discuss several well-known function classes in non-parametric estimation to illustrate the practical utility of this concept.
    Harnessing the Power of Choices in Decision Tree Learning. (arXiv:2310.01551v2 [cs.LG] UPDATED)
    We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-$k$, considers the $k$ best attributes as possible splits instead of just the single best attribute. We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a {\sl greediness hierarchy theorem} showing that for every $k \in \mathbb{N}$, Top-$(k+1)$ can be dramatically more powerful than Top-$k$: there are data distributions for which the former achieves accuracy $1-\varepsilon$, whereas the latter only achieves accuracy $\frac1{2}+\varepsilon$. We then show, through extensive experiments, that Top-$k$ outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-$k$ consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-$k$ is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms.
    Privately Aligning Language Models with Reinforcement Learning. (arXiv:2310.16960v1 [cs.LG])
    Positioned between pre-training and user deployment, aligning large language models (LLMs) through reinforcement learning (RL) has emerged as a prevailing strategy for training instruction following-models such as ChatGPT. In this work, we initiate the study of privacy-preserving alignment of LLMs through Differential Privacy (DP) in conjunction with RL. Following the influential work of Ziegler et al. (2020), we study two dominant paradigms: (i) alignment via RL without human in the loop (e.g., positive review generation) and (ii) alignment via RL from human feedback (RLHF) (e.g., summarization in a human-preferred way). We give a new DP framework to achieve alignment via RL, and prove its correctness. Our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
    Learning to Rank for Active Learning via Multi-Task Bilevel Optimization. (arXiv:2310.17044v1 [cs.LG])
    Active learning is a promising paradigm to reduce the labeling cost by strategically requesting labels to improve model performance. However, existing active learning methods often rely on expensive acquisition function to compute, extensive modeling retraining and multiple rounds of interaction with annotators. To address these limitations, we propose a novel approach for active learning, which aims to select batches of unlabeled instances through a learned surrogate model for data acquisition. A key challenge in this approach is developing an acquisition function that generalizes well, as the history of data, which forms part of the utility function's input, grows over time. Our novel algorithmic contribution is a bilevel multi-task bilevel optimization framework that predicts the relative utility -- measured by the validation accuracy -- of different training sets, and ensures the learned acquisition function generalizes effectively. For cases where validation accuracy is expensive to evaluate, we introduce efficient interpolation-based surrogate models to estimate the utility function, reducing the evaluation cost. We demonstrate the performance of our approach through extensive experiments on standard active classification benchmarks. By employing our learned utility function, we show significant improvements over traditional techniques, paving the way for more efficient and effective utility maximization in active learning applications.
    Strategizing EV Charging and Renewable Integration in Texas. (arXiv:2310.17056v1 [eess.SY])
    Exploring the convergence of electric vehicles (EVs), renewable energy, and smart grid technologies in the context of Texas, this study addresses challenges hindering the widespread adoption of EVs. Acknowledging their environmental benefits, the research focuses on grid stability concerns, uncoordinated charging patterns, and the complicated relationship between EVs and renewable energy sources. Dynamic time warping (DTW) clustering and k-means clustering methodologies categorize days based on total load and net load, offering nuanced insights into daily electricity consumption and renewable energy generation patterns. By establishing optimal charging and vehicle-to-grid (V2G) windows tailored to specific load characteristics, the study provides a sophisticated methodology for strategic decision-making in energy consumption and renewable integration. The findings contribute to the ongoing discourse on achieving a sustainable and resilient energy future through the seamless integration of EVs into smart grids.
    Isometric Motion Manifold Primitives. (arXiv:2310.17072v1 [cs.AI])
    The Motion Manifold Primitive (MMP) produces, for a given task, a continuous manifold of trajectories each of which can successfully complete the task. It consists of the decoder function that parametrizes the manifold and the probability density in the latent coordinate space. In this paper, we first show that the MMP performance can significantly degrade due to the geometric distortion in the latent space -- by distortion, we mean that similar motions are not located nearby in the latent space. We then propose {\it Isometric Motion Manifold Primitives (IMMP)} whose latent coordinate space preserves the geometry of the manifold. For this purpose, we formulate and use a Riemannian metric for the motion space (i.e., parametric curve space), which we call a {\it CurveGeom Riemannian metric}. Experiments with planar obstacle-avoiding motions and pushing manipulation tasks show that IMMP significantly outperforms existing MMP methods. Code is available at https://github.com/Gabe-YHLee/IMMP-public.
    Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models. (arXiv:2310.17086v1 [cs.LG])
    Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.
    Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian Inference. (arXiv:2310.16975v1 [stat.ML])
    We present two neural network approaches that approximate the solutions of static and dynamic conditional optimal transport (COT) problems, respectively. Both approaches enable sampling and density estimation of conditional probability distributions, which are core tasks in Bayesian inference. Our methods represent the target conditional distributions as transformations of a tractable reference distribution and, therefore, fall into the framework of measure transport. COT maps are a canonical choice within this framework, with desirable properties such as uniqueness and monotonicity. However, the associated COT problems are computationally challenging, even in moderate dimensions. To improve the scalability, our numerical algorithms leverage neural networks to parameterize COT maps. Our methods exploit the structure of the static and dynamic formulations of the COT problem. PCP-Map models conditional transport maps as the gradient of a partially input convex neural network (PICNN) and uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. COT-Flow models conditional transports via the flow of a regularized neural ODE; it is slower to train but offers faster sampling. We demonstrate their effectiveness and efficiency by comparing them with state-of-the-art approaches using benchmark datasets and Bayesian inverse problems.
    Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark. (arXiv:2310.16981v1 [cs.LG])
    Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation -- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.
    Conditionally Combining Robot Skills using Large Language Models. (arXiv:2310.17019v1 [cs.LG])
    This paper combines two contributions. First, we introduce an extension of the Meta-World benchmark, which we call "Language-World," which allows a large language model to operate in a simulated robotic environment using semi-structured natural language queries and scripted skills described using natural language. By using the same set of tasks as Meta-World, Language-World results can be easily compared to Meta-World results, allowing for a point of comparison between recent methods using Large Language Models (LLMs) and those using Deep Reinforcement Learning. Second, we introduce a method we call Plan Conditioned Behavioral Cloning (PCBC), that allows finetuning the behavior of high-level plans using end-to-end demonstrations. Using Language-World, we show that PCBC is able to achieve strong performance in a variety of few-shot regimes, often achieving task generalization with as little as a single demonstration. We have made Language-World available as open-source software at https://github.com/krzentner/language-world/.
    Streaming Factor Trajectory Learning for Temporal Tensor Decomposition. (arXiv:2310.17021v1 [cs.LG])
    Practical tensor data is often along with time information. Most existing temporal decomposition approaches estimate a set of fixed factors for the objects in each tensor mode, and hence cannot capture the temporal evolution of the objects' representation. More important, we lack an effective approach to capture such evolution from streaming data, which is common in real-world applications. To address these issues, we propose Streaming Factor Trajectory Learning (SFTL) for temporal tensor decomposition. We use Gaussian processes (GPs) to model the trajectory of factors so as to flexibly estimate their temporal evolution. To address the computational challenges in handling streaming data, we convert the GPs into a state-space prior by constructing an equivalent stochastic differential equation (SDE). We develop an efficient online filtering algorithm to estimate a decoupled running posterior of the involved factor states upon receiving new data. The decoupled estimation enables us to conduct standard Rauch-Tung-Striebel smoothing to compute the full posterior of all the trajectories in parallel, without the need for revisiting any previous data. We have shown the advantage of SFTL in both synthetic tasks and real-world applications.
    Road Network Guided Fine-Grained Urban Traffic Flow Inference. (arXiv:2109.14251v3 [cs.LG] UPDATED)
    Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem, which can help greatly reduce the number of the required traffic monitoring sensors for cost savings. In this work, we notice that traffic flow has a high correlation with road network, which was either completely ignored or simply treated as an external factor in previous works. To facilitate this problem, we propose a novel Road-Aware Traffic Flow Magnifier (RATFM) that explicitly exploits the prior knowledge of road networks to fully learn the road-aware spatial distribution of fine-grained traffic flow. Specifically, a multi-directional 1D convolutional layer is first introduced to extract the semantic feature of the road network. Subsequently, we incorporate the road network feature and coarse-grained flow feature to regularize the short-range spatial distribution modeling of road-relative traffic flow. Furthermore, we take the road network feature as a query to capture the long-range spatial distribution of traffic flow with a transformer architecture. Benefiting from the road-aware inference mechanism, our method can generate high-quality fine-grained traffic flow maps. Extensive experiments on three real-world datasets show that the proposed RATFM outperforms state-of-the-art models under various scenarios. Our code and datasets are released at {\url{https://github.com/luimoli/RATFM}}.
    Label Embedding via Low-Coherence Matrices. (arXiv:2305.19470v3 [cs.LG] UPDATED)
    Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.
    Controlled Decoding from Language Models. (arXiv:2310.17022v1 [cs.LG])
    We propose controlled decoding (CD), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. CD solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. The prefix scorer is used at inference time to steer the generation towards higher reward outcomes. We show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. We empirically demonstrate that CD is effective as a control mechanism on Reddit conversations corpus. We also show that the modularity of the design of CD makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. Finally, we show that CD can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-$K$ strategy and token-level reinforcement learning. This makes CD a promising approach for alignment of language models.
    Towards Matching Phones and Speech Representations. (arXiv:2310.17558v1 [cs.CL])
    Learning phone types from phone instances has been a long-standing problem, while still being open. In this work, we revisit this problem in the context of self-supervised learning, and pose it as the problem of matching cluster centroids to phone embeddings. We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones. We then use the matching result to produce pseudo-labels and introduce a new loss function for improving self-supervised representations. Our experiments show that the matching result captures the relationship among phones. Training the new loss function jointly with the regular self-supervised losses, such as APC and CPC, significantly improves the downstream phone classification.
    Lithium Metal Battery Quality Control via Transformer-CNN Segmentation. (arXiv:2302.04824v2 [cs.CV] UPDATED)
    Lithium metal battery (LMB) has the potential to be the next-generation battery system because of its high theoretical energy density. However, defects known as dendrites are formed by heterogeneous lithium (Li) plating, which hinders the development and utilization of LMBs. Non-destructive techniques to observe the dendrite morphology often use X-ray computed tomography (XCT) to provide cross-sectional views. To retrieve three-dimensional structures inside a battery, image segmentation becomes essential to quantitatively analyze XCT images. This work proposes a new semantic segmentation approach using a transformer-based neural network called TransforCNN that is capable of segmenting out dendrites from XCT data. In addition, we compare the performance of the proposed TransforCNN with three other algorithms, such as U-Net, Y-Net, and E-Net, consisting of an Ensemble Network model for XCT analysis. Our results show the advantages of using TransforCNN when evaluating over-segmentation metrics, such as mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDSC) as well as through several qualitatively comparative visualizations.
    Statistically Valid Variable Importance Assessment through Conditional Permutations. (arXiv:2309.07593v2 [cs.LG] UPDATED)
    Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An experiment on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.
    Looping in the Human: Collaborative and Explainable Bayesian Optimization. (arXiv:2310.17273v1 [cs.LG])
    Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods.
    Neuro-Inspired Fragmentation and Recall to Overcome Catastrophic Forgetting in Curiosity. (arXiv:2310.17537v1 [cs.AI])
    Deep reinforcement learning methods exhibit impressive performance on a range of tasks but still struggle on hard exploration tasks in large environments with sparse rewards. To address this, intrinsic rewards can be generated using forward model prediction errors that decrease as the environment becomes known, and incentivize an agent to explore novel states. While prediction-based intrinsic rewards can help agents solve hard exploration tasks, they can suffer from catastrophic forgetting and actually increase at visited states. We first examine the conditions and causes of catastrophic forgetting in grid world environments. We then propose a new method FARCuriosity, inspired by how humans and animals learn. The method depends on fragmentation and recall: an agent fragments an environment based on surprisal, and uses different local curiosity modules (prediction-based intrinsic reward functions) for each fragment so that modules are not trained on the entire environment. At each fragmentation event, the agent stores the current module in long-term memory (LTM) and either initializes a new module or recalls a previously stored module based on its match with the current state. With fragmentation and recall, FARCuriosity achieves less forgetting and better overall performance in games with varied and heterogeneous environments in the Atari benchmark suite of tasks. Thus, this work highlights the problem of catastrophic forgetting in prediction-based curiosity methods and proposes a solution.
    Efficient Numerical Algorithm for Large-Scale Damped Natural Gradient Descent. (arXiv:2310.17556v1 [cs.LG])
    We propose a new algorithm for efficiently solving the damped Fisher matrix in large-scale scenarios where the number of parameters significantly exceeds the number of available samples. This problem is fundamental for natural gradient descent and stochastic reconfiguration. Our algorithm is based on Cholesky decomposition and is generally applicable. Benchmark results show that the algorithm is significantly faster than existing methods.
    Fine-tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning. (arXiv:2203.09249v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging distributed learning paradigm under privacy constraint. Data heterogeneity is one of the main challenges in FL, which results in slow convergence and degraded performance. Most existing approaches only tackle the heterogeneity challenge by restricting the local model update in client, ignoring the performance drop caused by direct global model aggregation. Instead, we propose a data-free knowledge distillation method to fine-tune the global model in the server (FedFTG), which relieves the issue of direct model aggregation. Concretely, FedFTG explores the input space of local models through a generator, and uses it to transfer the knowledge from local models to the global model. Besides, we propose a hard sample mining scheme to achieve effective knowledge distillation throughout the training. In addition, we develop customized label sampling and class-level ensemble to derive maximum utilization of knowledge, which implicitly mitigates the distribution discrepancy across clients. Extensive experiments show that our FedFTG significantly outperforms the state-of-the-art (SOTA) FL algorithms and can serve as a strong plugin for enhancing FedAvg, FedProx, FedDyn, and SCAFFOLD.
    Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals. (arXiv:2302.04449v3 [cs.LG] UPDATED)
    High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.
    SoK: Pitfalls in Evaluating Black-Box Attacks. (arXiv:2310.17534v1 [cs.CR])
    Numerous works study black-box attacks on image classifiers. However, these works make different assumptions on the adversary's knowledge and current literature lacks a cohesive organization centered around the threat model. To systematize knowledge in this area, we propose a taxonomy over the threat space spanning the axes of feedback granularity, the access of interactive queries, and the quality and quantity of the auxiliary data available to the attacker. Our new taxonomy provides three key insights. 1) Despite extensive literature, numerous under-explored threat spaces exist, which cannot be trivially solved by adapting techniques from well-explored settings. We demonstrate this by establishing a new state-of-the-art in the less-studied setting of access to top-k confidence scores by adapting techniques from well-explored settings of accessing the complete confidence vector, but show how it still falls short of the more restrictive setting that only obtains the prediction label, highlighting the need for more research. 2) Identification the threat model of different attacks uncovers stronger baselines that challenge prior state-of-the-art claims. We demonstrate this by enhancing an initially weaker baseline (under interactive query access) via surrogate models, effectively overturning claims in the respective paper. 3) Our taxonomy reveals interactions between attacker knowledge that connect well to related areas, such as model inversion and extraction attacks. We discuss how advances in other areas can enable potentially stronger black-box attacks. Finally, we emphasize the need for a more realistic assessment of attack success by factoring in local attack runtime. This approach reveals the potential for certain attacks to achieve notably higher success rates and the need to evaluate attacks in diverse and harder settings, highlighting the need for better selection criteria.
    Little Exploration is All You Need. (arXiv:2310.17538v1 [cs.LG])
    The prevailing principle of "Optimism in the Face of Uncertainty" advocates for the incorporation of an exploration bonus, generally assumed to be proportional to the inverse square root of the visit count ($1/\sqrt{n}$), where $n$ is the number of visits to a particular state-action pair. This approach, however, exclusively focuses on "uncertainty," neglecting the inherent "difficulty" of different options. To address this gap, we introduce a novel modification of standard UCB algorithm in the multi-armed bandit problem, proposing an adjusted bonus term of $1/n^\tau$, where $\tau > 1/2$, that accounts for task difficulty. Our proposed algorithm, denoted as UCB$^\tau$, is substantiated through comprehensive regret and risk analyses, confirming its theoretical robustness. Comparative evaluations with standard UCB and Thompson Sampling algorithms on synthetic datasets demonstrate that UCB$^\tau$ not only outperforms in efficacy but also exhibits lower risk across various environmental conditions and hyperparameter settings.
    Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions. (arXiv:2310.17502v1 [cs.SD])
    Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.
    Cross-feature Contrastive Loss for Decentralized Deep Learning on Heterogeneous Data. (arXiv:2310.15890v2 [cs.LG] UPDATED)
    The current state-of-the-art decentralized learning algorithms mostly assume the data distribution to be Independent and Identically Distributed (IID). However, in practical scenarios, the distributed datasets can have significantly heterogeneous data distributions across the agents. In this work, we present a novel approach for decentralized learning on heterogeneous data, where data-free knowledge distillation through contrastive loss on cross-features is utilized to improve performance. Cross-features for a pair of neighboring agents are the features (i.e., last hidden layer activations) obtained from the data of an agent with respect to the model parameters of the other agent. We demonstrate the effectiveness of the proposed technique through an exhaustive set of experiments on various Computer Vision datasets (CIFAR-10, CIFAR-100, Fashion MNIST, Imagenette, and ImageNet), model architectures, and network topologies. Our experiments show that the proposed method achieves superior performance (0.2-4% improvement in test accuracy) compared to other existing techniques for decentralized learning on heterogeneous data.
    Improving Few-Shot Learning through Multi-task Representation Learning Theory. (arXiv:2010.01992v3 [cs.LG] CROSS LISTED)
    In this paper, we consider the framework of multi-task representation (MTR) learning where the goal is to use source tasks to learn a representation that reduces the sample complexity of solving a target task. We start by reviewing recent advances in MTR theory and show that they can provide novel insights for popular meta-learning algorithms when analyzed within this framework. In particular, we highlight a fundamental difference between gradient-based and metric-based algorithms in practice and put forward a theoretical analysis to explain it. Finally, we use the derived insights to improve the performance of meta-learning methods via a new spectral-based regularization term and confirm its efficiency through experimental studies on few-shot classification benchmarks. To the best of our knowledge, this is the first contribution that puts the most recent learning bounds of MTR theory into practice for the task of few-shot classification.
    StochGradAdam: Accelerating Neural Networks Training with Stochastic Gradient Sampling. (arXiv:2310.17042v1 [cs.LG])
    In the rapidly advancing domain of deep learning optimization, this paper unveils the StochGradAdam optimizer, a novel adaptation of the well-regarded Adam algorithm. Central to StochGradAdam is its gradient sampling technique. This method not only ensures stable convergence but also leverages the advantages of selective gradient consideration, fostering robust training by potentially mitigating the effects of noisy or outlier data and enhancing the exploration of the loss landscape for more dependable convergence. In both image classification and segmentation tasks, StochGradAdam has demonstrated superior performance compared to the traditional Adam optimizer. By judiciously sampling a subset of gradients at each iteration, the optimizer is optimized for managing intricate models. The paper provides a comprehensive exploration of StochGradAdam's methodology, from its mathematical foundations to bias correction strategies, heralding a promising advancement in deep learning training techniques.
    FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing. (arXiv:2310.17491v1 [cs.LG])
    The emergence of foundation models, including language and vision models, has reshaped AI's landscape, offering capabilities across various applications. Deploying and fine-tuning these large models, like GPT-3 and BERT, presents challenges, especially in the current foundation model era. We introduce Emulator-Assisted Tuning (EAT) combined with Parameter-Efficient Fine-Tuning (PEFT) to form Parameter-Efficient Emulator-Assisted Tuning (PEAT). Further, we expand this into federated learning as Federated PEAT (FedPEAT). FedPEAT uses adapters, emulators, and PEFT for federated model tuning, enhancing model privacy and memory efficiency. Adapters adjust pre-trained models, while emulators give a compact representation of original models, addressing both privacy and efficiency. Adaptable to various neural networks, our approach also uses deep reinforcement learning for hyper-parameter optimization. We tested FedPEAT in a unique scenario with a server participating in collaborative federated tuning, showcasing its potential in tackling foundation model challenges.
    IDENAS: Internal Dependency Exploration for Neural Architecture Search. (arXiv:2310.17250v1 [cs.LG])
    Machine learning is a powerful tool for extracting valuable information and making various predictions from diverse datasets. Traditional algorithms rely on well-defined input and output variables however, there are scenarios where the distinction between the input and output variables and the underlying, associated (input and output) layers of the model, are unknown. Neural Architecture Search (NAS) and Feature Selection have emerged as promising solutions in such scenarios. This research proposes IDENAS, an Internal Dependency-based Exploration for Neural Architecture Search, integrating NAS with feature selection. The methodology explores internal dependencies in the complete parameter space for classification involving 1D sensor and 2D image data as well. IDENAS employs a modified encoder-decoder model and the Sequential Forward Search (SFS) algorithm, combining input-output configuration search with embedded feature selection. Experimental results demonstrate IDENASs superior performance in comparison to other algorithms, showcasing its effectiveness in model development pipelines and automated machine learning. On average, IDENAS achieved significant modelling improvements, underscoring its significant contribution to advancing the state-of-the-art in neural architecture search and feature selection integration.
    Diagnosing Alzheimer's Disease using Early-Late Multimodal Data Fusion with Jacobian Maps. (arXiv:2310.16936v1 [cs.CV])
    Alzheimer's disease (AD) is a prevalent and debilitating neurodegenerative disorder impacting a large aging population. Detecting AD in all its presymptomatic and symptomatic stages is crucial for early intervention and treatment. An active research direction is to explore machine learning methods that harness multimodal data fusion to outperform human inspection of medical scans. However, existing multimodal fusion models have limitations, including redundant computation, complex architecture, and simplistic handling of missing data. Moreover, the preprocessing pipelines of medical scans remain inadequately detailed and are seldom optimized for individual subjects. In this paper, we propose an efficient early-late fusion (ELF) approach, which leverages a convolutional neural network for automated feature extraction and random forests for their competitive performance on small datasets. Additionally, we introduce a robust preprocessing pipeline that adapts to the unique characteristics of individual subjects and makes use of whole brain images rather than slices or patches. Moreover, to tackle the challenge of detecting subtle changes in brain volume, we transform images into the Jacobian domain (JD) to enhance both accuracy and robustness in our classification. Using MRI and CT images from the OASIS-3 dataset, our experiments demonstrate the effectiveness of the ELF approach in classifying AD into four stages with an accuracy of 97.19%.
    Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM Cells for Embedded FPGAs. (arXiv:2310.16842v1 [cs.AR])
    To process sensor data in the Internet of Things(IoTs), embedded deep learning for 1-dimensional data is an important technique. In the past, CNNs were frequently used because they are simple to optimise for special embedded hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices. Using the traffic speed prediction as a case study, a vanilla LSTM model with the optimised LSTM cell achieves 17534 inferences per second while consuming only 3.8 $\mu$J per inference on the FPGA \textit{XC7S15} from \textit{Spartan-7} family. It achieves at least 5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than existing approaches.
    Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers. (arXiv:2305.14858v2 [cs.LG] UPDATED)
    Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers. While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value. Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers. There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models. It is challenging to convert Transformers with one normalization to the other type. While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors. We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference. It implies that Pre-LN Transformers can be substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement. Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by 1% - 10%.
    Can Differentiable Decision Trees Learn Interpretable Reward Functions?. (arXiv:2306.13004v3 [cs.LG] UPDATED)
    There is an increasing interest in learning reward functions that model human preferences. However, many frameworks use blackbox learning methods that, while expressive, are difficult to interpret. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs). Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences. We provide experimental evidence that reward DDTs can achieve competitive performance when compared with larger capacity deep neural network reward functions. We also observe that the choice between soft and hard (argmax) output of reward DDT reveals a tension between wanting highly shaped rewards to ensure good RL performance, while also wanting simpler, more interpretable rewards.
    Sky Imager-Based Forecast of Solar Irradiance Using Machine Learning. (arXiv:2310.17356v1 [cs.CV])
    Ahead-of-time forecasting of the output power of power plants is essential for the stability of the electricity grid and ensuring uninterrupted service. However, forecasting renewable energy sources is difficult due to the chaotic behavior of natural energy sources. This paper presents a new approach to estimate short-term solar irradiance from sky images. The~proposed algorithm extracts features from sky images and use learning-based techniques to estimate the solar irradiance. The~performance of proposed machine learning (ML) algorithm is evaluated using two publicly available datasets of sky images. The~datasets contain over 350,000 images for an interval of 16 years, from 2004 to 2020, with the corresponding global horizontal irradiance (GHI) of each image as the ground truth. Compared to the state-of-the-art computationally heavy algorithms proposed in the literature, our approach achieves competitive results with much less computational complexity for both nowcasting and forecasting up to 4 h ahead of time.
    Unleashing the potential of GNNs via Bi-directional Knowledge Transfer. (arXiv:2310.17132v1 [cs.LG])
    Based on the message-passing paradigm, there has been an amount of research proposing diverse and impressive feature propagation mechanisms to improve the performance of GNNs. However, less focus has been put on feature transformation, another major operation of the message-passing framework. In this paper, we first empirically investigate the performance of the feature transformation operation in several typical GNNs. Unexpectedly, we notice that GNNs do not completely free up the power of the inherent feature transformation operation. By this observation, we propose the Bi-directional Knowledge Transfer (BiKT), a plug-and-play approach to unleash the potential of the feature transformation operations without modifying the original architecture. Taking the feature transformation operation as a derived representation learning model that shares parameters with the original GNN, the direct prediction by this model provides a topological-agnostic knowledge feedback that can further instruct the learning of GNN and the feature transformations therein. On this basis, BiKT not only allows us to acquire knowledge from both the GNN and its derived model but promotes each other by injecting the knowledge into the other. In addition, a theoretical analysis is further provided to demonstrate that BiKT improves the generalization bound of the GNNs from the perspective of domain adaption. An extensive group of experiments on up to 7 datasets with 5 typical GNNs demonstrates that BiKT brings up to 0.5% - 4% performance gain over the original GNN, which means a boosted GNN is obtained. Meanwhile, the derived model also shows a powerful performance to compete with or even surpass the original GNN, enabling us to flexibly apply it independently to some other specific downstream tasks.
    Effective Targeted Attacks for Adversarial Self-Supervised Learning. (arXiv:2210.10482v2 [cs.LG] UPDATED)
    Recently, unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information. Previous studies in unsupervised AT have mostly focused on implementing self-supervised learning (SSL) frameworks, which maximize the instance-wise classification loss to generate adversarial examples. However, we observe that simply maximizing the self-supervised training loss with an untargeted adversarial attack often results in generating ineffective adversaries that may not help improve the robustness of the trained model, especially for non-contrastive SSL frameworks without negative examples. To tackle this problem, we propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks. Specifically, we introduce an algorithm that selects the most confusing yet similar target example for a given instance based on entropy and similarity, and subsequently perturbs the given instance towards the selected target. Our method demonstrates significant enhancements in robustness when applied to non-contrastive SSL frameworks, and less but consistent robustness improvements with contrastive SSL frameworks, on the benchmark datasets.
    Learning Space-Time Continuous Neural PDEs from Partially Observed States. (arXiv:2307.04110v2 [cs.LG] UPDATED)
    We introduce a novel grid-independent model for learning partial differential equations (PDEs) from noisy and partial observations on irregular spatiotemporal grids. We propose a space-time continuous latent neural PDE model with an efficient probabilistic framework and a novel encoder design for improved data efficiency and grid independence. The latent state dynamics are governed by a PDE model that combines the collocation method and the method of lines. We employ amortized variational inference for approximate posterior estimation and utilize a multiple shooting technique for enhanced training speed and stability. Our model demonstrates state-of-the-art performance on complex synthetic and real-world datasets, overcoming limitations of previous approaches and effectively handling partially-observed data. The proposed model outperforms recent methods, showing its potential to advance data-driven PDE modeling and enabling robust, grid-independent modeling of complex partially-observed dynamic processes.
    Learning an Inventory Control Policy with General Inventory Arrival Dynamics. (arXiv:2310.17168v1 [cs.LG])
    In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.
    A Theory of Link Prediction via Relational Weisfeiler-Leman on Knowledge Graphs. (arXiv:2302.02209v4 [cs.LG] UPDATED)
    Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains incomplete in the context of knowledge graphs. Our goal is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining to the prominent task of link prediction. Our analysis entails a unifying perspective on seemingly unrelated models and unlocks a series of other models. The expressive power of various models is characterized via a corresponding relational Weisfeiler-Leman algorithm. This analysis is extended to provide a precise logical characterization of the class of functions captured by a class of graph neural networks. The theoretical findings presented in this paper explain the benefits of some widely employed practical design choices, which are validated empirically.
    Exploring the Trie of Rules: a fast data structure for the representation of association rules. (arXiv:2310.17355v1 [cs.LG])
    Association rule mining techniques can generate a large volume of sequential data when implemented on transactional databases. Extracting insights from a large set of association rules has been found to be a challenging process. When examining a ruleset, the fundamental question is how to summarise and represent meaningful mined knowledge efficiently. Many algorithms and strategies have been developed to address issue of knowledge extraction; however, the effectiveness of this process can be limited by the data structures. A better data structure can sufficiently affect the speed of the knowledge extraction process. This paper proposes a novel data structure, called the Trie of rules, for storing a ruleset that is generated by association rule mining. The resulting data structure is a prefix-tree graph structure made of pre-mined rules. This graph stores the rules as paths within the prefix-tree in a way that similar rules overlay each other. Each node in the tree represents a rule where a consequent is this node, and an antecedent is a path from this node to the root of the tree. The evaluation showed that the proposed representation technique is promising. It compresses a ruleset with almost no data loss and benefits in terms of time for basic operations such as searching for a specific rule and sorting, which is the base for many knowledge discovery methods. Moreover, our method demonstrated a significant improvement in traversing time, achieving an 8-fold increase compared to traditional data structures.
    Universal Test-time Adaptation through Weight Ensembling, Diversity Weighting, and Prior Correction. (arXiv:2306.00650v2 [cs.CV] UPDATED)
    Since distribution shifts are likely to occur during test-time and can drastically decrease the model's performance, online test-time adaptation (TTA) continues to update the model after deployment, leveraging the current test data. Clearly, a method proposed for online TTA has to perform well for all kinds of environmental conditions. By introducing the variable factors domain non-stationarity and temporal correlation, we first unfold all practically relevant settings and define the entity as universal TTA. We want to highlight that this is the first work that covers such a broad spectrum, which is indispensable for the use in practice. To tackle the problem of universal TTA, we identify and highlight several challenges a self-training based method has to deal with: 1) model bias and the occurrence of trivial solutions when performing entropy minimization on varying sequence lengths with and without multiple domain shifts, 2) loss of generalization which exacerbates the adaptation to multiple domain shifts and the occurrence of catastrophic forgetting, and 3) performance degradation due to shifts in class prior. To prevent the model from becoming biased, we leverage a dataset and model-agnostic certainty and diversity weighting. In order to maintain generalization and prevent catastrophic forgetting, we propose to continually weight-average the source and adapted model. To compensate for disparities in the class prior during test-time, we propose an adaptive prior correction scheme that reweights the model's predictions. We evaluate our approach, named ROID, on a wide range of settings, datasets, and models, setting new standards in the field of universal TTA. Code is available at: https://github.com/mariodoebler/test-time-adaptation
    STEER: Semantic Turn Extension-Expansion Recognition for Voice Assistants. (arXiv:2310.16990v1 [cs.CL])
    In the context of a voice assistant system, steering refers to the phenomenon in which a user issues a follow-up command attempting to direct or clarify a previous turn. We propose STEER, a steering detection model that predicts whether a follow-up turn is a user's attempt to steer the previous command. Constructing a training dataset for steering use cases poses challenges due to the cold-start problem. To overcome this, we developed heuristic rules to sample opt-in usage data, approximating positive and negative samples without any annotation. Our experimental results show promising performance in identifying steering intent, with over 95% accuracy on our sampled data. Moreover, STEER, in conjunction with our sampling strategy, aligns effectively with real-world steering scenarios, as evidenced by its strong zero-shot performance on a human-graded evaluation set. In addition to relying solely on user transcripts as input, we introduce STEER+, an enhanced version of the model. STEER+ utilizes a semantic parse tree to provide more context on out-of-vocabulary words, such as named entities that often occur at the sentence boundary. This further improves model performance, reducing error rate in domains where entities frequently appear, such as messaging. Lastly, we present a data analysis that highlights the improvement in user experience when voice assistants support steering use cases.
    A Deep Learning Approach to Teeth Segmentation and Orientation from Panoramic X-rays. (arXiv:2310.17176v1 [cs.CV])
    Accurate teeth segmentation and orientation are fundamental in modern oral healthcare, enabling precise diagnosis, treatment planning, and dental implant design. In this study, we present a comprehensive approach to teeth segmentation and orientation from panoramic X-ray images, leveraging deep learning techniques. We build our model based on FUSegNet, a popular model originally developed for wound segmentation, and introduce modifications by incorporating grid-based attention gates into the skip connections. We introduce oriented bounding box (OBB) generation through principal component analysis (PCA) for precise tooth orientation estimation. Evaluating our approach on the publicly available DNS dataset, comprising 543 panoramic X-ray images, we achieve the highest Intersection-over-Union (IoU) score of 82.43% and Dice Similarity Coefficient (DSC) score of 90.37% among compared models in teeth instance segmentation. In OBB analysis, we obtain the Rotated IoU (RIoU) score of 82.82%. We also conduct detailed analyses of individual tooth labels and categorical performance, shedding light on strengths and weaknesses. The proposed model's accuracy and versatility offer promising prospects for improving dental diagnoses, treatment planning, and personalized healthcare in the oral domain. Our generated OBB coordinates and codes are available at https://github.com/mrinal054/Instance_teeth_segmentation.
    Emergent representations in networks trained with the Forward-Forward algorithm. (arXiv:2305.18353v2 [cs.NE] UPDATED)
    The Backpropagation algorithm has often been criticised for its lack of biological realism. In an attempt to find a more biologically plausible alternative, the recently introduced Forward-Forward algorithm replaces the forward and backward passes of Backpropagation with two forward passes. In this work, we show that the internal representations obtained by the Forward-Forward algorithm can organise into category-specific ensembles exhibiting high sparsity - i.e. composed of an extremely low number of active units. This situation is reminiscent of what has been observed in cortical sensory areas, where neuronal ensembles are suggested to serve as the functional building blocks for perception and action. Interestingly, while this sparse pattern does not typically arise in models trained with standard Backpropagation, it can emerge in networks trained with Backpropagation on the same objective proposed for the Forward-Forward algorithm. These results suggest that the learning procedure proposed by Forward-Forward may be superior to Backpropagation in modelling learning in the cortex, even when a backward pass is used.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v5 [cs.LG] UPDATED)
    Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from minimax optimality to comparator adaptivity. That is, sensible regret bounds should depend on certain complexity measures of the comparator relative to one's prior knowledge. This paper achieves a new type of these adaptive regret bounds via a sparse coding framework. The complexity of the comparator is measured by its energy and its sparsity on a user-specified dictionary, which offers considerable versatility. Equipped with a wavelet dictionary for example, our framework improves the state-of-the-art bound (Jacobsen & Cutkosky, 2022) by adapting to both ($i$) the magnitude of the comparator average $||\bar u||=||\sum_{t=1}^Tu_t/T||$, rather than the maximum $\max_t||u_t||$; and ($ii$) the comparator variability $\sum_{t=1}^T||u_t-\bar u||$, rather than the uncentered sum $\sum_{t=1}^T||u_t||$. Furthermore, our analysis is simpler due to decoupling function approximation from regret minimization.
    An Optimal and Scalable Matrix Mechanism for Noisy Marginals under Convex Loss Functions. (arXiv:2305.08175v2 [cs.DB] UPDATED)
    Noisy marginals are a common form of confidentiality-protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner, a matrix mechanism for marginals with Gaussian noise that is both optimal and scalable. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets).
    Efficient Diffusion Policies for Offline Reinforcement Learning. (arXiv:2305.20081v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.
    Multitask Online Learning: Listen to the Neighborhood Buzz. (arXiv:2310.17385v1 [cs.LG])
    We study multitask online learning in a setting where agents can only exchange information with their neighbors on an arbitrary communication network. We introduce $\texttt{MT-CO}_2\texttt{OL}$, a decentralized algorithm for this setting whose regret depends on the interplay between the task similarities and the network structure. Our analysis shows that the regret of $\texttt{MT-CO}_2\texttt{OL}$ is never worse (up to constants) than the bound obtained when agents do not share information. On the other hand, our bounds significantly improve when neighboring agents operate on similar tasks. In addition, we prove that our algorithm can be made differentially private with a negligible impact on the regret when the losses are linear. Finally, we provide experimental support for our theory.
    Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates. (arXiv:2305.13082v2 [math.OC] UPDATED)
    In this paper, we propose the first sketch-and-project Newton method with fast $\mathcal O(k^{-2})$ global convergence rate for self-concordant functions. Our method, SGN, can be viewed in three ways: i) as a sketch-and-project algorithm projecting updates of Newton method, ii) as a cubically regularized Newton ethod in sketched subspaces, and iii) as a damped Newton method in sketched subspaces. SGN inherits best of all three worlds: cheap iteration costs of sketch-and-project methods, state-of-the-art $\mathcal O(k^{-2})$ global convergence rate of full-rank Newton-like methods and the algorithm simplicity of damped Newton methods. Finally, we demonstrate its comparable empirical performance to baseline algorithms.
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v3 [cs.LG] UPDATED)
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property that has recently proved its utility in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.
    Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. (arXiv:2307.07063v3 [cs.CV] UPDATED)
    We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code will be made available at https://github.com/yiren-jian/BLIText.
    Counterfactual-Augmented Importance Sampling for Semi-Offline Policy Evaluation. (arXiv:2310.17146v1 [cs.LG])
    In applying reinforcement learning (RL) to high-stakes domains, quantitative and qualitative evaluation using observational data can help practitioners understand the generalization performance of new policies. However, this type of off-policy evaluation (OPE) is inherently limited since offline data may not reflect the distribution shifts resulting from the application of new policies. On the other hand, online evaluation by collecting rollouts according to the new policy is often infeasible, as deploying new policies in these domains can be unsafe. In this work, we propose a semi-offline evaluation framework as an intermediate step between offline and online evaluation, where human users provide annotations of unobserved counterfactual trajectories. While tempting to simply augment existing data with such annotations, we show that this naive approach can lead to biased results. Instead, we design a new family of OPE estimators based on importance sampling (IS) and a novel weighting scheme that incorporate counterfactual annotations without introducing additional bias. We analyze the theoretical properties of our approach, showing its potential to reduce both bias and variance compared to standard IS estimators. Our analyses reveal important practical considerations for handling biased, noisy, or missing annotations. In a series of proof-of-concept experiments involving bandits and a healthcare-inspired simulator, we demonstrate that our approach outperforms purely offline IS estimators and is robust to imperfect annotations. Our framework, combined with principled human-centered design of annotation solicitation, can enable the application of RL in high-stakes domains.
    Sequential Memory with Temporal Predictive Coding. (arXiv:2305.11982v2 [q-bio.NC] UPDATED)
    Forming accurate memory of sequential stimuli is a fundamental function of biological agents. However, the computational mechanism underlying sequential memory in the brain remains unclear. Inspired by neuroscience theories and recent successes in applying predictive coding (PC) to \emph{static} memory tasks, in this work we propose a novel PC-based model for \emph{sequential} memory, called \emph{temporal predictive coding} (tPC). We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation. Importantly, our analytical study reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) with an implicit statistical whitening process, which leads to more stable performance in sequential memory tasks of structured inputs. Moreover, we find that tPC exhibits properties consistent with behavioral observations and theories in neuroscience, thereby strengthening its biological relevance. Our work establishes a possible computational mechanism underlying sequential memory in the brain that can also be theoretically interpreted using existing memory model frameworks.
    Taming Gradient Variance in Federated Learning with Networked Control Variates. (arXiv:2310.17200v1 [cs.LG])
    Federated learning, a decentralized approach to machine learning, faces significant challenges such as extensive communication overheads, slow convergence, and unstable improvements. These challenges primarily stem from the gradient variance due to heterogeneous client data distributions. To address this, we introduce a novel Networked Control Variates (FedNCV) framework for Federated Learning. We adopt the REINFORCE Leave-One-Out (RLOO) as a fundamental control variate unit in the FedNCV framework, implemented at both client and server levels. At the client level, the RLOO control variate is employed to optimize local gradient updates, mitigating the variance introduced by data samples. Once relayed to the server, the RLOO-based estimator further provides an unbiased and low-variance aggregated gradient, leading to robust global updates. This dual-side application is formalized as a linear combination of composite control variates. We provide a mathematical expression capturing this integration of double control variates within FedNCV and present three theoretical results with corresponding proofs. This unique dual structure equips FedNCV to address data heterogeneity and scalability issues, thus potentially paving the way for large-scale applications. Moreover, we tested FedNCV on six diverse datasets under a Dirichlet distribution with {\alpha} = 0.1, and benchmarked its performance against six SOTA methods, demonstrating its superiority.
    On the Identifiability and Interpretability of Gaussian Process Models. (arXiv:2310.17023v1 [stat.ML])
    In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Mat\'ern. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.
    Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting. (arXiv:2310.17032v1 [quant-ph])
    Accurately forecasting solar power generation is crucial in the global progression towards sustainable energy systems. In this study, we conduct a meticulous comparison between Quantum Long Short-Term Memory (QLSTM) and classical Long Short-Term Memory (LSTM) models for solar power production forecasting. Our controlled experiments reveal promising advantages of QLSTMs, including accelerated training convergence and substantially reduced test loss within the initial epoch compared to classical LSTMs. These empirical findings demonstrate QLSTM's potential to swiftly assimilate complex time series relationships, enabled by quantum phenomena like superposition. However, realizing QLSTM's full capabilities necessitates further research into model validation across diverse conditions, systematic hyperparameter optimization, hardware noise resilience, and applications to correlated renewable forecasting problems. With continued progress, quantum machine learning can offer a paradigm shift in renewable energy time series prediction. This pioneering work provides initial evidence substantiating quantum advantages over classical LSTM, while acknowledging present limitations. Through rigorous benchmarking grounded in real-world data, our study elucidates a promising trajectory for quantum learning in renewable forecasting. Additional research and development can further actualize this potential to achieve unprecedented accuracy and reliability in predicting solar power generation worldwide.
    An Explainable Deep Learning-Based Method For Schizophrenia Diagnosis Using Generative Data-Augmentation. (arXiv:2310.16867v1 [cs.LG])
    In this study, we leverage a deep learning-based method for the automatic diagnosis of schizophrenia using EEG brain recordings. This approach utilizes generative data augmentation, a powerful technique that enhances the accuracy of the diagnosis. To enable the utilization of time-frequency features, spectrograms were extracted from the raw signals. After exploring several neural network architectural setups, a proper convolutional neural network (CNN) was used for the initial diagnosis. Subsequently, using Wasserstein GAN with Gradient Penalty (WGAN-GP) and Variational Autoencoder (VAE), two different synthetic datasets were generated in order to augment the initial dataset and address the over-fitting issue. The augmented dataset using VAE achieved a 3.0\% improvement in accuracy reaching up to 99.0\% and yielded a lower loss value as well as a faster convergence. Finally, we addressed the lack of trust in black-box models using the Local Interpretable Model-agnostic Explanations (LIME) algorithm to determine the most important superpixels (frequencies) in the diagnosis process.
    The Significance of Machine Learning in Clinical Disease Diagnosis: A Review. (arXiv:2310.16978v1 [cs.LG])
    The global need for effective disease diagnosis remains substantial, given the complexities of various disease mechanisms and diverse patient symptoms. To tackle these challenges, researchers, physicians, and patients are turning to machine learning (ML), an artificial intelligence (AI) discipline, to develop solutions. By leveraging sophisticated ML and AI methods, healthcare stakeholders gain enhanced diagnostic and treatment capabilities. However, there is a scarcity of research focused on ML algorithms for enhancing the accuracy and computational efficiency. This research investigates the capacity of machine learning algorithms to improve the transmission of heart rate data in time series healthcare metrics, concentrating particularly on optimizing accuracy and efficiency. By exploring various ML algorithms used in healthcare applications, the review presents the latest trends and approaches in ML-based disease diagnosis (MLBDD). The factors under consideration include the algorithm utilized, the types of diseases targeted, the data types employed, the applications, and the evaluation metrics. This review aims to shed light on the prospects of ML in healthcare, particularly in disease diagnosis. By analyzing the current literature, the study provides insights into state-of-the-art methodologies and their performance metrics.
    Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control. (arXiv:2309.14597v2 [cs.LG] UPDATED)
    Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In this work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v3 [cs.LG] UPDATED)
    A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.
    Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2112.12458v3 [cs.LG] UPDATED)
    Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research.
    Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits. (arXiv:2306.07923v2 [cs.LG] UPDATED)
    We consider offline policy optimization (OPO) in contextual bandits, where one is given a fixed dataset of logged interactions. While pessimistic regularizers are typically used to mitigate distribution shift, prior implementations thereof are either specialized or computationally inefficient. We present the first general oracle-efficient algorithm for pessimistic OPO: it reduces to supervised learning, leading to broad applicability. We obtain statistical guarantees analogous to those for prior pessimistic approaches. We instantiate our approach for both discrete and continuous actions and perform experiments in both settings, showing advantage over unregularized OPO across a wide range of configurations.
    BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds. (arXiv:2310.17281v1 [cs.CV])
    We present a surprisingly simple and efficient method for self-supervision of 3D backbone on automotive Lidar point clouds. We design a contrastive loss between features of Lidar scans captured in the same scene. Several such approaches have been proposed in the literature from PointConstrast, which uses a contrast at the level of points, to the state-of-the-art TARL, which uses a contrast at the level of segments, roughly corresponding to objects. While the former enjoys a great simplicity of implementation, it is surpassed by the latter, which however requires a costly pre-processing. In BEVContrast, we define our contrast at the level of 2D cells in the Bird's Eye View plane. Resulting cell-level representations offer a good trade-off between the point-level representations exploited in PointContrast and segment-level representations exploited in TARL: we retain the simplicity of PointContrast (cell representations are cheap to compute) while surpassing the performance of TARL in downstream semantic segmentation.
    Transformer-based Atmospheric Density Forecasting. (arXiv:2310.16912v1 [physics.ao-ph])
    As the peak of the solar cycle approaches in 2025 and the ability of a single geomagnetic storm to significantly alter the orbit of Resident Space Objects (RSOs), techniques for atmospheric density forecasting are vital for space situational awareness. While linear data-driven methods, such as dynamic mode decomposition with control (DMDc), have been used previously for forecasting atmospheric density, deep learning-based forecasting has the ability to capture nonlinearities in data. By learning multiple layer weights from historical atmospheric density data, long-term dependencies in the dataset are captured in the mapping between the current atmospheric density state and control input to the atmospheric density state at the next timestep. This work improves upon previous linear propagation methods for atmospheric density forecasting, by developing a nonlinear transformer-based architecture for atmospheric density forecasting. Empirical NRLMSISE-00 and JB2008, as well as physics-based TIEGCM atmospheric density models are compared for forecasting with DMDc and with the transformer-based propagator.  ( 2 min )
    Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks. (arXiv:2310.16955v1 [cs.LG])
    Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.  ( 2 min )
    Probabilistic Integral Circuits. (arXiv:2310.16986v1 [cs.LG])
    Continuous latent variables (LVs) are a key ingredient of many generative models, as they allow modelling expressive mixtures with an uncountable number of components. In contrast, probabilistic circuits (PCs) are hierarchical discrete mixtures represented as computational graphs composed of input, sum and product units. Unlike continuous LV models, PCs provide tractable inference but are limited to discrete LVs with categorical (i.e. unordered) states. We bridge these model classes by introducing probabilistic integral circuits (PICs), a new language of computational graphs that extends PCs with integral units representing continuous LVs. In the first place, PICs are symbolic computational graphs and are fully tractable in simple cases where analytical integration is possible. In practice, we parameterise PICs with light-weight neural nets delivering an intractable hierarchical continuous mixture that can be approximated arbitrarily well with large PCs using numerical quadrature. On several distribution estimation benchmarks, we show that such PIC-approximating PCs systematically outperform PCs commonly learned via expectation-maximization or SGD.  ( 2 min )
    Improvement in Alzheimer's Disease MRI Images Analysis by Convolutional Neural Networks Via Topological Optimization. (arXiv:2310.16857v1 [eess.IV])
    This research underscores the efficacy of Fourier topological optimization in refining MRI imagery, thereby bolstering the classification precision of Alzheimer's Disease through convolutional neural networks. Recognizing that MRI scans are indispensable for neurological assessments, but frequently grapple with issues like blurriness and contrast irregularities, the deployment of Fourier topological optimization offered enhanced delineation of brain structures, ameliorated noise, and superior contrast. The applied techniques prioritized boundary enhancement, contrast and brightness adjustments, and overall image lucidity. Employing CNN architectures VGG16, ResNet50, InceptionV3, and Xception, the post-optimization analysis revealed a marked elevation in performance. Conclusively, the amalgamation of Fourier topological optimization with CNNs delineates a promising trajectory for the nuanced classification of Alzheimer's Disease, portending a transformative impact on its diagnostic paradigms.  ( 2 min )
    Squared Neural Families: A New Class of Tractable Density Models. (arXiv:2305.13552v2 [cs.LG] UPDATED)
    Flexible models for probability distributions are an essential ingredient in many machine learning tasks. We develop and investigate a new class of probability distributions, which we call a Squared Neural Family (SNEFY), formed by squaring the 2-norm of a neural network and normalising it with respect to a base measure. Following the reasoning similar to the well established connections between infinitely wide neural networks and Gaussian processes, we show that SNEFYs admit closed form normalising constants in many cases of interest, thereby resulting in flexible yet fully tractable density models. SNEFYs strictly generalise classical exponential families, are closed under conditioning, and have tractable marginal distributions. Their utility is illustrated on a variety of density estimation, conditional density estimation, and density estimation with missing data tasks.
    MimicTouch: Learning Human's Control Strategy with Multi-Modal Tactile Feedback. (arXiv:2310.16917v1 [cs.RO])
    In robotics and artificial intelligence, the integration of tactile processing is becoming increasingly pivotal, especially in learning to execute intricate tasks like alignment and insertion. However, existing works focusing on tactile methods for insertion tasks predominantly rely on robot teleoperation data and reinforcement learning, which do not utilize the rich insights provided by human's control strategy guided by tactile feedback. For utilizing human sensations, methodologies related to learning from humans predominantly leverage visual feedback, often overlooking the invaluable tactile feedback that humans inherently employ to finish complex manipulations. Addressing this gap, we introduce "MimicTouch", a novel framework that mimics human's tactile-guided control strategy. In this framework, we initially collect multi-modal tactile datasets from human demonstrators, incorporating human tactile-guided control strategies for task completion. The subsequent step involves instructing robots through imitation learning using multi-modal sensor data and retargeted human motions. To further mitigate the embodiment gap between humans and robots, we employ online residual reinforcement learning on the physical robot. Through comprehensive experiments, we validate the safety of MimicTouch in transferring a latent policy learned through imitation learning from human to robot. This ongoing work will pave the way for a broader spectrum of tactile-guided robotic applications.  ( 2 min )
    Zephyr: Direct Distillation of LM Alignment. (arXiv:2310.16944v1 [cs.LG])
    We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.  ( 2 min )
    Improving Few-shot Generalization of Safety Classifiers via Data Augmented Parameter-Efficient Fine-Tuning. (arXiv:2310.16959v1 [cs.LG])
    As large language models (LLMs) are widely adopted, new safety issues and policies emerge, to which existing safety classifiers do not generalize well. If we have only observed a few examples of violations of a new safety rule, how can we build a classifier to detect violations? In this paper, we study the novel setting of domain-generalized few-shot learning for LLM-based text safety classifiers. Unlike prior few-shot work, these new safety issues can be hard to uncover and we do not get to choose the few examples. We demonstrate that existing few-shot techniques do not perform well in this setting, and rather we propose to do parameter-efficient fine-tuning (PEFT) combined with augmenting training data based on similar examples in prior existing rules. We empirically show that our approach of similarity-based data-augmentation + prompt-tuning (DAPT) consistently outperforms baselines that either do not rely on data augmentation or on PEFT by 7-17% F1 score in the Social Chemistry moral judgement and 9-13% AUC in the Toxicity detection tasks, even when the new rule is loosely correlated with existing ones.  ( 2 min )
    Causal Q-Aggregation for CATE Model Selection. (arXiv:2310.16945v1 [stat.ML])
    Accurate estimation of conditional average treatment effects (CATE) is at the core of personalized decision making. While there is a plethora of models for CATE estimation, model selection is a nontrivial task, due to the fundamental problem of causal inference. Recent empirical work provides evidence in favor of proxy loss metrics with double robust properties and in favor of model ensembling. However, theoretical understanding is lacking. Direct application of prior theoretical work leads to suboptimal oracle model selection rates due to the non-convexity of the model selection problem. We provide regret rates for the major existing CATE ensembling approaches and propose a new CATE model ensembling approach based on Q-aggregation using the doubly robust loss. Our main result shows that causal Q-aggregation achieves statistically optimal oracle model selection regret rates of $\frac{\log(M)}{n}$ (with $M$ models and $n$ samples), with the addition of higher-order estimation error terms related to products of errors in the nuisance functions. Crucially, our regret rate does not require that any of the candidate CATE models be close to the truth. We validate our new method on many semi-synthetic datasets and also provide extensions of our work to CATE model selection with instrumental variables and unobserved confounding.  ( 2 min )
    Towards Continually Learning Application Performance Models. (arXiv:2310.16996v1 [cs.LG])
    Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2x improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.  ( 2 min )
    Exploring Behavior Discovery Methods for Heterogeneous Swarms of Limited-Capability Robots. (arXiv:2310.16941v1 [cs.RO])
    We study the problem of determining the emergent behaviors that are possible given a functionally heterogeneous swarm of robots with limited capabilities. Prior work has considered behavior search for homogeneous swarms and proposed the use of novelty search over either a hand-specified or learned behavior space followed by clustering to return a taxonomy of emergent behaviors to the user. In this paper, we seek to better understand the role of novelty search and the efficacy of using clustering to discover novel emergent behaviors. Through a large set of experiments and ablations, we analyze the effect of representations, evolutionary search, and various clustering methods in the search for novel behaviors in a heterogeneous swarm. Our results indicate that prior methods fail to discover many interesting behaviors and that an iterative human-in-the-loop discovery process discovers more behaviors than random search, swarm chemistry, and automated behavior discovery. The combined discoveries of our experiments uncover 23 emergent behaviors, 18 of which are novel discoveries. To the best of our knowledge, these are the first known emergent behaviors for heterogeneous swarms of computation-free agents. Videos, code, and appendix are available at the project website: https://sites.google.com/view/heterogeneous-bd-methods  ( 2 min )
    General Point Model with Autoencoding and Autoregressive. (arXiv:2310.16861v1 [cs.LG])
    The pre-training architectures of large language models encompass various types, including autoencoding models, autoregressive models, and encoder-decoder models. We posit that any modality can potentially benefit from a large language model, as long as it undergoes vector quantization to become discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which seamlessly integrates autoencoding and autoregressive tasks in point cloud transformer. This model is versatile, allowing fine-tuning for downstream point cloud representation tasks, as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks, leading to improved performance in point cloud understanding. Additionally, GPM demonstrates highly competitive results in unconditional point cloud generation tasks, even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves superior performance in point cloud understanding tasks. Furthermore, the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks.  ( 2 min )
  • Open

    A Neural Collapse Perspective on Feature Evolution in Graph Neural Networks. (arXiv:2307.01951v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become increasingly popular for classification tasks on graph-structured data. Yet, the interplay between graph topology and feature evolution in GNNs is not well understood. In this paper, we focus on node-wise classification, illustrated with community detection on stochastic block model graphs, and explore the feature evolution through the lens of the "Neural Collapse" (NC) phenomenon. When training instance-wise deep classifiers (e.g. for image classification) beyond the zero training error point, NC demonstrates a reduction in the deepest features' within-class variability and an increased alignment of their class means to certain symmetric structures. We start with an empirical study that shows that a decrease in within-class variability is also prevalent in the node-wise classification setting, however, not to the extent observed in the instance-wise case. Then, we theoretically study this distinction. Specifically, we show that even an "optimistic" mathematical model requires that the graphs obey a strict structural condition in order to possess a minimizer with exact collapse. Interestingly, this condition is viable also for heterophilic graphs and relates to recent empirical studies on settings with improved GNNs' generalization. Furthermore, by studying the gradient dynamics of the theoretical model, we provide reasoning for the partial collapse observed empirically. Finally, we present a study on the evolution of within- and between-class feature variability across layers of a well-trained GNN and contrast the behavior with spectral methods.  ( 3 min )
    Assessing the overall and partial causal well-specification of nonlinear additive noise models. (arXiv:2310.16502v2 [stat.ME] UPDATED)
    We propose a method to detect model misspecifications in nonlinear causal additive and potentially heteroscedastic noise models. We aim to identify predictor variables for which we can infer the causal effect even in cases of such misspecification. We develop a general framework based on knowledge of the multivariate observational data distribution and we then propose an algorithm for finite sample data, discuss its asymptotic properties, and illustrate its performance on simulated and real data.
    Learning Rate Free Bayesian Inference in Constrained Domains. (arXiv:2305.14943v2 [stat.ML] UPDATED)
    We introduce a suite of new particle-based algorithms for sampling on constrained domains which are entirely learning rate free. Our approach leverages coin betting ideas from convex optimisation, and the viewpoint of constrained sampling as a mirrored optimisation problem on the space of probability measures. Based on this viewpoint, we also introduce a unifying framework for several existing constrained sampling algorithms, including mirrored Langevin dynamics and mirrored Stein variational gradient descent. We demonstrate the performance of our algorithms on a range of numerical examples, including sampling from targets on the simplex, sampling with fairness constraints, and constrained sampling problems in post-selection inference. Our results indicate that our algorithms achieve competitive performance with existing constrained sampling methods, without the need to tune any hyperparameters.
    Multi-scale Diffusion Denoised Smoothing. (arXiv:2310.16779v2 [cs.LG] UPDATED)
    Along with recent diffusion models, randomized smoothing has become one of a few tangible approaches that offers adversarial robustness to models at scale, e.g., those of large pre-trained models. Specifically, one can perform randomized smoothing on any classifier via a simple "denoise-and-classify" pipeline, so-called denoised smoothing, given that an accurate denoiser is available - such as diffusion model. In this paper, we present scalable methods to address the current trade-off between certified robustness and accuracy in denoised smoothing. Our key idea is to "selectively" apply smoothing among multiple noise scales, coined multi-scale smoothing, which can be efficiently implemented with a single diffusion model. This approach also suggests a new objective to compare the collective robustness of multi-scale smoothed classifiers, and questions which representation of diffusion model would maximize the objective. To address this, we propose to further fine-tune diffusion model (a) to perform consistent denoising whenever the original image is recoverable, but (b) to generate rather diverse outputs otherwise. Our experiments show that the proposed multi-scale smoothing scheme combined with diffusion fine-tuning enables strong certified robustness available with high noise level while maintaining its accuracy closer to non-smoothed classifiers.  ( 2 min )
    Robust Output Analysis with Monte-Carlo Methodology. (arXiv:2207.13612v3 [stat.ME] CROSS LISTED)
    In predictive modeling with simulation or machine learning, it is critical to accurately assess the quality of estimated values through output analysis. In recent decades output analysis has become enriched with methods that quantify the impact of input data uncertainty in the model outputs to increase robustness. However, most developments are applicable assuming that the input data adheres to a parametric family of distributions. We propose a unified output analysis framework for simulation and machine learning outputs through the lens of Monte Carlo sampling. This framework provides nonparametric quantification of the variance and bias induced in the outputs with higher-order accuracy. Our new bias-corrected estimation from the model outputs leverages the extension of fast iterative bootstrap sampling and higher-order influence functions. For the scalability of the proposed estimation methods, we devise budget-optimal rules and leverage control variates for variance reduction. Our theoretical and numerical results demonstrate a clear advantage in building more robust confidence intervals from the model outputs with higher coverage probability.  ( 2 min )
    A Mean Field Approach to Empirical Bayes Estimation in High-dimensional Linear Regression. (arXiv:2309.16843v2 [math.ST] UPDATED)
    We study empirical Bayes estimation in high-dimensional linear regression. To facilitate computationally efficient estimation of the underlying prior, we adopt a variational empirical Bayes approach, introduced originally in Carbonetto and Stephens (2012) and Kim et al. (2022). We establish asymptotic consistency of the nonparametric maximum likelihood estimator (NPMLE) and its (computable) naive mean field variational surrogate under mild assumptions on the design and the prior. Assuming, in addition, that the naive mean field approximation has a dominant optimizer, we develop a computationally efficient approximation to the oracle posterior distribution, and establish its accuracy under the 1-Wasserstein metric. This enables computationally feasible Bayesian inference; e.g., construction of posterior credible intervals with an average coverage guarantee, Bayes optimal estimation for the regression coefficients, estimation of the proportion of non-nulls, etc. Our analysis covers both deterministic and random designs, and accommodates correlations among the features. To the best of our knowledge, this provides the first rigorous nonparametric empirical Bayes method in a high-dimensional regression setting without sparsity.  ( 2 min )
    Statistically Valid Variable Importance Assessment through Conditional Permutations. (arXiv:2309.07593v2 [cs.LG] UPDATED)
    Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An experiment on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.  ( 3 min )
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v5 [cs.LG] UPDATED)
    Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from minimax optimality to comparator adaptivity. That is, sensible regret bounds should depend on certain complexity measures of the comparator relative to one's prior knowledge. This paper achieves a new type of these adaptive regret bounds via a sparse coding framework. The complexity of the comparator is measured by its energy and its sparsity on a user-specified dictionary, which offers considerable versatility. Equipped with a wavelet dictionary for example, our framework improves the state-of-the-art bound (Jacobsen & Cutkosky, 2022) by adapting to both ($i$) the magnitude of the comparator average $||\bar u||=||\sum_{t=1}^Tu_t/T||$, rather than the maximum $\max_t||u_t||$; and ($ii$) the comparator variability $\sum_{t=1}^T||u_t-\bar u||$, rather than the uncentered sum $\sum_{t=1}^T||u_t||$. Furthermore, our analysis is simpler due to decoupling function approximation from regret minimization.  ( 3 min )
    Robust Covariate Shift Adaptation for Density-Ratio Estimation. (arXiv:2310.16638v2 [stat.ME] UPDATED)
    Consider a scenario where we have access to train data with both covariates and outcomes while test data only contains covariates. In this scenario, our primary aim is to predict the missing outcomes of the test data. With this objective in mind, we train parametric regression models under a covariate shift, where covariate distributions are different between the train and test data. For this problem, existing studies have proposed covariate shift adaptation via importance weighting using the density ratio. This approach averages the train data losses, each weighted by an estimated ratio of the covariate densities between the train and test data, to approximate the test-data risk. Although it allows us to obtain a test-data risk minimizer, its performance heavily relies on the accuracy of the density ratio estimation. Moreover, even if the density ratio can be consistently estimated, the estimation errors of the density ratio also yield bias in the estimators of the regression model's parameters of interest. To mitigate these challenges, we introduce a doubly robust estimator for covariate shift adaptation via importance weighting, which incorporates an additional estimator for the regression function. Leveraging double machine learning techniques, our estimator reduces the bias arising from the density ratio estimation errors. We demonstrate the asymptotic distribution of the regression parameter estimator. Notably, our estimator remains consistent if either the density ratio estimator or the regression function is consistent, showcasing its robustness against potential errors in density ratio estimation. Finally, we confirm the soundness of our proposed method via simulation studies.
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v3 [cs.LG] UPDATED)
    A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.  ( 2 min )
    Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory. (arXiv:2307.04204v2 [cs.LG] UPDATED)
    Cohen et al. (2021) empirically study the evolution of the largest eigenvalue of the loss Hessian, also known as sharpness, along the gradient descent (GD) trajectory and observe the Edge of Stability (EoS) phenomenon. The sharpness increases at the early phase of training (referred to as progressive sharpening), and eventually saturates close to the threshold of $2 / \text{(step size)}$. In this paper, we start by demonstrating through empirical studies that when the EoS phenomenon occurs, different GD trajectories (after a proper reparameterization) align on a specific bifurcation diagram independent of initialization. We then rigorously prove this trajectory alignment phenomenon for a two-layer fully-connected linear network and a single-neuron nonlinear network trained with a single data point. Our trajectory alignment analysis establishes both progressive sharpening and EoS phenomena, encompassing and extending recent findings in the literature.  ( 2 min )
    Statistical Component Separation for Targeted Signal Recovery in Noisy Mixtures. (arXiv:2306.15012v2 [stat.ML] UPDATED)
    Separating signals from an additive mixture may be an unnecessarily hard problem when one is only interested in specific properties of a given signal. In this work, we tackle simpler "statistical component separation" problems that focus on recovering a predefined set of statistical descriptors of a target signal from a noisy mixture. Assuming access to samples of the noise process, we investigate a method devised to match the statistics of the solution candidate corrupted by noise samples with those of the observed mixture. We first analyze the behavior of this method using simple examples with analytically tractable calculations. Then, we apply it in an image denoising context employing 1) wavelet-based descriptors, 2) ConvNet-based descriptors on astrophysics and ImageNet data. In the case of 1), we show that our method better recovers the descriptors of the target data than a standard denoising method in most situations. Additionally, despite not constructed for this purpose, it performs surprisingly well in terms of peak signal-to-noise ratio on full signal reconstruction. In comparison, representation 2) appears less suitable for image denoising. Finally, we extend this method by introducing a diffusive stepwise algorithm which gives a new perspective to the initial method and leads to promising results for image denoising under specific circumstances.  ( 3 min )
    Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension. (arXiv:2305.14077v2 [stat.ML] UPDATED)
    The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.  ( 3 min )
    Label Embedding via Low-Coherence Matrices. (arXiv:2305.19470v3 [cs.LG] UPDATED)
    Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.  ( 2 min )
    Distribution-Free Model-Agnostic Regression Calibration via Nonparametric Methods. (arXiv:2305.12283v2 [cs.LG] UPDATED)
    In this paper, we consider the uncertainty quantification problem for regression models. Specifically, we consider an individual calibration objective for characterizing the quantiles of the prediction model. While such an objective is well-motivated from downstream tasks such as newsvendor cost, the existing methods have been largely heuristic and lack of statistical guarantee in terms of individual calibration. We show via simple examples that the existing methods focusing on population-level calibration guarantees such as average calibration or sharpness can lead to harmful and unexpected results. We propose simple nonparametric calibration methods that are agnostic of the underlying prediction model and enjoy both computational efficiency and statistical consistency. Our approach enables a better understanding of the possibility of individual calibration, and we establish matching upper and lower bounds for the calibration error of our proposed methods. Technically, our analysis combines the nonparametric analysis with a covering number argument for parametric analysis, which advances the existing theoretical analyses in the literature of nonparametric density estimation and quantile bandit problems. Importantly, the nonparametric perspective sheds new theoretical insights into regression calibration in terms of the curse of dimensionality and reconciles the existing results on the impossibility of individual calibration. To our knowledge, we make the first effort to reach both individual calibration and finite-sample guarantee with minimal assumptions in terms of conformal prediction. Numerical experiments show the advantage of such a simple approach under various metrics, and also under covariates shift. We hope our work provides a simple benchmark and a starting point of theoretical ground for future research on regression calibration.  ( 3 min )
    No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions. (arXiv:2305.17380v3 [cs.LG] UPDATED)
    Existing online learning algorithms for adversarial Markov Decision Processes achieve ${O}(\sqrt{T})$ regret after $T$ rounds of interactions even if the loss functions are chosen arbitrarily by an adversary, with the caveat that the transition function has to be fixed. This is because it has been shown that adversarial transition functions make no-regret learning impossible. Despite such impossibility results, in this work, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose an algorithm that enjoys $\widetilde{{O}}(\sqrt{T} + C^{\textsf{P}})$ regret where $C^{\textsf{P}}$ measures how adversarial the transition functions are and can be at most ${O}(T)$. While this algorithm itself requires knowledge of $C^{\textsf{P}}$, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Jin et al. [2021]) and achieves $\widetilde{{O}}(U + \sqrt{UC^{\textsf{L}}} + C^{\textsf{P}})$ regret, where $U$ is some standard gap-dependent coefficient and $C^{\textsf{L}}$ is the amount of corruption on losses.  ( 3 min )
    Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL with General Regularizers and Multiple Optimal Arms. (arXiv:2302.13534v2 [cs.LG] UPDATED)
    We study the problem of designing adaptive multi-armed bandit algorithms that perform optimally in both the stochastic setting and the adversarial setting simultaneously (often known as a best-of-both-world guarantee). A line of recent works shows that when configured and analyzed properly, the Follow-the-Regularized-Leader (FTRL) algorithm, originally designed for the adversarial setting, can in fact optimally adapt to the stochastic setting as well. Such results, however, critically rely on an assumption that there exists one unique optimal arm. Recently, Ito (2021) took the first step to remove such an undesirable uniqueness assumption for one particular FTRL algorithm with the $\frac{1}{2}$-Tsallis entropy regularizer. In this work, we significantly improve and generalize this result, showing that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule. For some regularizers, our regret bounds also improve upon prior results even when uniqueness holds. We further provide an application of our results to the decoupled exploration and exploitation problem, demonstrating that our techniques are broadly applicable.  ( 3 min )
    Monte Carlo guided Diffusion for Bayesian linear inverse problems. (arXiv:2308.07983v2 [stat.ML] UPDATED)
    Ill-posed linear inverse problems arise frequently in various applications, from computational photography to medical imaging. A recent line of research exploits Bayesian inference with informative priors to handle the ill-posedness of such problems. Amongst such priors, score-based generative models (SGM) have recently been successfully applied to several different inverse problems. In this study, we exploit the particular structure of the prior defined by the SGM to define a sequence of intermediate linear inverse problems. As the noise level decreases, the posteriors of these inverse problems get closer to the target posterior of the original inverse problem. To sample from this sequence of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The proposed algorithm, MCGDiff, is shown to be theoretically grounded and we provide numerical simulations showing that it outperforms competing baselines when dealing with ill-posed inverse problems in a Bayesian setting.  ( 2 min )
    Hierarchical clustering with OWA-based linkages, the Lance-Williams formula, and dendrogram inversions. (arXiv:2303.05683v2 [stat.ML] UPDATED)
    Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions.  ( 2 min )
    RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion. (arXiv:2302.01757v2 [cs.CR] UPDATED)
    Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes.  ( 3 min )
    Mutual information of spin systems from autoregressive neural networks. (arXiv:2304.13412v2 [cond-mat.stat-mech] UPDATED)
    We describe a new direct method to estimate bipartite mutual information of a classical spin system based on Monte Carlo sampling enhanced by autoregressive neural networks. It allows studying arbitrary geometries of subsystems and can be generalized to classical field theories. We demonstrate it on the Ising model for four partitionings, including a multiply-connected even-odd division. We show that the area law is satisfied for temperatures away from the critical temperature: the constant term is universal, whereas the proportionality coefficient is different for the even-odd partitioning.  ( 2 min )
    Gaussian Membership Inference Privacy. (arXiv:2306.07273v2 [cs.LG] UPDATED)
    We propose a novel and practical privacy notion called $f$-Membership Inference Privacy ($f$-MIP), which explicitly considers the capabilities of realistic adversaries under the membership inference attack threat model. Consequently, $f$-MIP offers interpretable privacy guarantees and improved utility (e.g., better classification accuracy). In particular, we derive a parametric family of $f$-MIP guarantees that we refer to as $\mu$-Gaussian Membership Inference Privacy ($\mu$-GMIP) by theoretically analyzing likelihood ratio-based membership inference attacks on stochastic gradient descent (SGD). Our analysis highlights that models trained with standard SGD already offer an elementary level of MIP. Additionally, we show how $f$-MIP can be amplified by adding noise to gradient updates. Our analysis further yields an analytical membership inference attack that offers two distinct advantages over previous approaches. First, unlike existing state-of-the-art attacks that require training hundreds of shadow models, our attack does not require any shadow model. Second, our analytical attack enables straightforward auditing of our privacy notion $f$-MIP. Finally, we quantify how various hyperparameters (e.g., batch size, number of model parameters) and specific data characteristics determine an attacker's ability to accurately infer a point's membership in the training set. We demonstrate the effectiveness of our method on models trained on vision and tabular datasets.  ( 2 min )
    A Batch-to-Online Transformation under Random-Order Model. (arXiv:2306.07163v2 [cs.LG] UPDATED)
    We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.  ( 2 min )
    Instability of computer vision models is a necessary result of the task itself. (arXiv:2310.17559v1 [cs.CV])
    Adversarial examples resulting from instability of current computer vision models are an extremely important topic due to their potential to compromise any application. In this paper we demonstrate that instability is inevitable due to a) symmetries (translational invariance) of the data, b) the categorical nature of the classification task, and c) the fundamental discrepancy of classifying images as objects themselves. The issue is further exacerbated by non-exhaustive labelling of the training data. Therefore we conclude that instability is a necessary result of how the problem of computer vision is currently formulated. While the problem cannot be eliminated, through the analysis of the causes, we have arrived at ways how it can be partially alleviated. These include i) increasing the resolution of images, ii) providing contextual information for the image, iii) exhaustive labelling of training data, and iv) preventing attackers from frequent access to the computer vision system.  ( 2 min )
    Convergence of flow-based generative models via proximal gradient descent in Wasserstein space. (arXiv:2310.17582v1 [stat.ML])
    Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest.  ( 3 min )
    Bounding Box-based Multi-objective Bayesian Optimization of Risk Measures under Input Uncertainty. (arXiv:2301.11588v2 [stat.ML] UPDATED)
    In this study, we propose a novel multi-objective Bayesian optimization (MOBO) method to efficiently identify the Pareto front (PF) defined by risk measures for black-box functions under the presence of input uncertainty (IU). Existing BO methods for Pareto optimization in the presence of IU are risk-specific or without theoretical guarantees, whereas our proposed method addresses general risk measures and has theoretical guarantees. The basic idea of the proposed method is to assume a Gaussian process (GP) model for the black-box function and to construct high-probability bounding boxes for the risk measures using the GP model. Furthermore, in order to reduce the uncertainty of non-dominated bounding boxes, we propose a method of selecting the next evaluation point using a maximin distance defined by the maximum value of a quasi distance based on bounding boxes. As theoretical analysis, we prove that the algorithm can return an arbitrary-accurate solution in a finite number of iterations with high probability, for various risk measures such as Bayes risk, worst-case risk, and value-at-risk. We also give a theoretical analysis that takes into account approximation errors because there exist non-negligible approximation errors (e.g., finite approximation of PFs and sampling-based approximation of bounding boxes) in practice. We confirm that the proposed method outperforms compared with existing methods not only in the setting with IU but also in the setting of ordinary MOBO through numerical experiments.  ( 3 min )
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v3 [cs.LG] UPDATED)
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property that has recently proved its utility in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.  ( 2 min )
    The statistical thermodynamics of generative diffusion models. (arXiv:2310.17467v1 [stat.ML])
    Generative diffusion models have achieved spectacular performance in many areas of generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We argue that this lead to a form of instability that lies at the heart of their generative capabilities and that can be described by a set of mean field critical exponents. We conclude by analyzing recent work connecting diffusion models and associative memory networks in view of the thermodynamic formulations.  ( 2 min )
    Sequential Memory with Temporal Predictive Coding. (arXiv:2305.11982v2 [q-bio.NC] UPDATED)
    Forming accurate memory of sequential stimuli is a fundamental function of biological agents. However, the computational mechanism underlying sequential memory in the brain remains unclear. Inspired by neuroscience theories and recent successes in applying predictive coding (PC) to \emph{static} memory tasks, in this work we propose a novel PC-based model for \emph{sequential} memory, called \emph{temporal predictive coding} (tPC). We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation. Importantly, our analytical study reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) with an implicit statistical whitening process, which leads to more stable performance in sequential memory tasks of structured inputs. Moreover, we find that tPC exhibits properties consistent with behavioral observations and theories in neuroscience, thereby strengthening its biological relevance. Our work establishes a possible computational mechanism underlying sequential memory in the brain that can also be theoretically interpreted using existing memory model frameworks.  ( 2 min )
    Small Total-Cost Constraints in Contextual Bandits with Knapsacks, with Application to Fairness. (arXiv:2305.15807v2 [stat.ML] UPDATED)
    We consider contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The learner aims to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. We assume that contexts come from a continuous set, that costs can be signed, and that the expected reward and cost functions, while unknown, may be uniformly estimated -- a typical assumption in the literature. In this setting, total cost constraints had so far to be at least of order $T^{3/4}$, where $T$ is the number of rounds, and were even typically assumed to depend linearly on $T$. We are however motivated to use CBwK to impose a fairness constraint of equalized average costs between groups: the budget associated with the corresponding cost constraints should be as close as possible to the natural deviations, of order $\sqrt{T}$. To that end, we introduce a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of $\sqrt{T}$ up to poly-logarithmic terms. This strategy is more direct and simpler than existing strategies in the literature. It relies on a careful, adaptive, tuning of the step size.  ( 3 min )
    Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. (arXiv:2310.17074v1 [cs.LG])
    In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) algorithm with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by the large learning rate SGD training turns out to be beneficial to the generalization of the NN, which potentially improves over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon "benign oscillation". Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small $\ell_2$-norm and appear in each data point; (ii) strong features which have a larger $\ell_2$-norm but only appear in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but makes little progress in learning the weak features. Consequently, when it comes to the new testing data which consist of only weak features, the NN trained by oscillating SGD with a large learning rate could still make correct predictions consistently, while the NN trained by small learning rate SGD fails. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our finding on "benign oscillation".  ( 3 min )
    Unifying GANs and Score-Based Diffusion as Generative Particle Models. (arXiv:2305.16150v2 [cs.LG] UPDATED)
    Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper we challenge this interpretation, and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.  ( 2 min )
    Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits. (arXiv:2107.11419v2 [stat.ML] UPDATED)
    We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.  ( 2 min )
    The Expressive Power of Low-Rank Adaptation. (arXiv:2310.17513v1 [cs.LG])
    Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method that leverages low-rank adaptation of weight matrices, has emerged as a prevalent technique for fine-tuning pre-trained models such as large language models and diffusion models. Despite its huge success in practice, the theoretical underpinnings of LoRA have largely remained unexplored. This paper takes the first step to bridge this gap by theoretically analyzing the expressive power of LoRA. We prove that, for fully connected neural networks, LoRA can adapt any model $f$ to accurately represent any smaller target model $\overline{f}$ if LoRA-rank $\geq(\text{width of }f) \times \frac{\text{depth of }\overline{f}}{\text{depth of }f}$. We also quantify the approximation error when LoRA-rank is lower than the threshold. For Transformer networks, we show any model can be adapted to a target model of the same size with rank-$(\frac{\text{embedding size}}{2})$ LoRA adapters.  ( 2 min )
    Squared Neural Families: A New Class of Tractable Density Models. (arXiv:2305.13552v2 [cs.LG] UPDATED)
    Flexible models for probability distributions are an essential ingredient in many machine learning tasks. We develop and investigate a new class of probability distributions, which we call a Squared Neural Family (SNEFY), formed by squaring the 2-norm of a neural network and normalising it with respect to a base measure. Following the reasoning similar to the well established connections between infinitely wide neural networks and Gaussian processes, we show that SNEFYs admit closed form normalising constants in many cases of interest, thereby resulting in flexible yet fully tractable density models. SNEFYs strictly generalise classical exponential families, are closed under conditioning, and have tractable marginal distributions. Their utility is illustrated on a variety of density estimation, conditional density estimation, and density estimation with missing data tasks.  ( 2 min )
    Generative Fractional Diffusion Models. (arXiv:2310.17638v1 [cs.LG])
    We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index $H\in(0,1)$ of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.  ( 2 min )
    Improving Neural Additive Models with Bayesian Principles. (arXiv:2305.16905v2 [stat.ML] UPDATED)
    Neural additive models (NAMs) can improve the interpretability of deep neural networks by handling input features in separate additive sub-networks. However, they lack inherent mechanisms that provide calibrated uncertainties and enable selection of relevant features and interactions. Approaching NAMs from a Bayesian perspective, we enhance them in three primary ways, namely by a) providing credible intervals for the individual additive sub-networks; b) estimating the marginal likelihood to perform an implicit selection of features via an empirical Bayes procedure; and c) enabling a ranking of feature pairs as candidates for second-order interaction in fine-tuned models. In particular, we develop Laplace-approximated NAMs (LA-NAMs), which show improved empirical performance on tabular datasets and challenging real-world medical tasks.  ( 2 min )
    Looping in the Human: Collaborative and Explainable Bayesian Optimization. (arXiv:2310.17273v1 [cs.LG])
    Like many optimizers, Bayesian optimization often falls short of gaining user trust due to opacity. While attempts have been made to develop human-centric optimizers, they typically assume user knowledge is well-specified and error-free, employing users mainly as supervisors of the optimization process. We relax these assumptions and propose a more balanced human-AI partnership with our Collaborative and Explainable Bayesian Optimization (CoExBO) framework. Instead of explicitly requiring a user to provide a knowledge model, CoExBO employs preference learning to seamlessly integrate human insights into the optimization, resulting in algorithmic suggestions that resonate with user preference. CoExBO explains its candidate selection every iteration to foster trust, empowering users with a clearer grasp of the optimization. Furthermore, CoExBO offers a no-harm guarantee, allowing users to make mistakes; even with extreme adversarial interventions, the algorithm converges asymptotically to a vanilla Bayesian optimization. We validate CoExBO's efficacy through human-AI teaming experiments in lithium-ion battery design, highlighting substantial improvements over conventional methods.  ( 2 min )
    A framework for benchmarking clustering algorithms. (arXiv:2209.09493v3 [cs.LG] UPDATED)
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at .  ( 2 min )
    Kernel Stein Discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization. (arXiv:2301.13528v3 [math.ST] UPDATED)
    Stein thinning is a promising algorithm proposed by (Riabiz et al., 2022) for post-processing outputs of Markov chain Monte Carlo (MCMC). The main principle is to greedily minimize the kernelized Stein discrepancy (KSD), which only requires the gradient of the log-target distribution, and is thus well-suited for Bayesian inference. The main advantages of Stein thinning are the automatic remove of the burn-in period, the correction of the bias introduced by recent MCMC algorithms, and the asymptotic properties of convergence towards the target distribution. Nevertheless, Stein thinning suffers from several empirical pathologies, which may result in poor approximations, as observed in the literature. In this article, we conduct a theoretical analysis of these pathologies, to clearly identify the mechanisms at stake, and suggest improved strategies. Then, we introduce the regularized Stein thinning algorithm to alleviate the identified pathologies. Finally, theoretical guarantees and extensive experiments show the high efficiency of the proposed algorithm. An implementation of regularized Stein thinning as the kernax library in python and JAX is available at https://gitlab.com/drti/kernax.  ( 3 min )
    Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers (extended version). (arXiv:2310.17629v1 [math.ST])
    The out-of-sample error (OO) is the main quantity of interest in risk estimation and model selection. Leave-one-out cross validation (LO) offers a (nearly) distribution-free yet computationally demanding approach to estimate OO. Recent theoretical work showed that approximate leave-one-out cross validation (ALO) is a computationally efficient and statistically reliable estimate of LO (and OO) for generalized linear models with differentiable regularizers. For problems involving non-differentiable regularizers, despite significant empirical evidence, the theoretical understanding of ALO's error remains unknown. In this paper, we present a novel theory for a wide class of problems in the generalized linear model family with non-differentiable regularizers. We bound the error |ALO - LO| in terms of intuitive metrics such as the size of leave-i-out perturbations in active sets, sample size n, number of features p and regularization parameters. As a consequence, for the $\ell_1$-regularized problems, we show that |ALO - LO| goes to zero as p goes to infinity while n/p and SNR are fixed and bounded.  ( 2 min )
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v2 [cs.LG] UPDATED)
    Many real-world domains require safe decision making in uncertain environments. In this work, we introduce a deep reinforcement learning framework for approaching this important problem. We consider a distribution over transition models, and apply a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. Unlike existing approaches to robustness in deep reinforcement learning, however, our formulation does not involve minimax optimization. This leads to an efficient, model-free implementation of our approach that only requires standard data collection from a single training environment. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.  ( 2 min )
    Online Estimation and Community Detection of Network Point Processes for Event Streams. (arXiv:2009.01742v3 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 3 min )
    Uncovering Meanings of Embeddings via Partial Orthogonality. (arXiv:2310.17611v1 [cs.LG])
    Machine learning tools often rely on embedding text as vectors of real numbers. In this paper, we study how the semantic structure of language is encoded in the algebraic structure of such embeddings. Specifically, we look at a notion of ``semantic independence'' capturing the idea that, e.g., ``eggplant'' and ``tomato'' are independent given ``vegetable''. Although such examples are intuitive, it is difficult to formalize such a notion of semantic independence. The key observation here is that any sensible formalization should obey a set of so-called independence axioms, and thus any algebraic encoding of this structure should also obey these axioms. This leads us naturally to use partial orthogonality as the relevant algebraic structure. We develop theory and methods that allow us to demonstrate that partial orthogonality does indeed capture semantic independence. Complementary to this, we also introduce the concept of independence preserving embeddings where embeddings preserve the conditional independence structures of a distribution, and we prove the existence of such embeddings and approximations to them.  ( 2 min )
    Large-Scale Gaussian Processes via Alternating Projection. (arXiv:2310.17137v1 [cs.LG])
    Gaussian process (GP) hyperparameter optimization requires repeatedly solving linear systems with $n \times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative numerical methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the corresponding kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling \emph{mini-batching}. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove our method enjoys linear convergence and empirically we demonstrate its robustness to ill-conditioning. On large-scale benchmark datasets up to four million datapoints our approach accelerates training by a factor of 2$\times$ to 27$\times$ compared to CG.  ( 2 min )
    Learning an Inventory Control Policy with General Inventory Arrival Dynamics. (arXiv:2310.17168v1 [cs.LG])
    In this paper we address the problem of learning and backtesting inventory control policies in the presence of general arrival dynamics -- which we term as a quantity-over-time arrivals model (QOT). We also allow for order quantities to be modified as a post-processing step to meet vendor constraints such as order minimum and batch size constraints -- a common practice in real supply chains. To the best of our knowledge this is the first work to handle either arbitrary arrival dynamics or an arbitrary downstream post-processing of order quantities. Building upon recent work (Madeka et al., 2022) we similarly formulate the periodic review inventory control problem as an exogenous decision process, where most of the state is outside the control of the agent. Madeka et al. (2022) show how to construct a simulator that replays historic data to solve this class of problem. In our case, we incorporate a deep generative model for the arrivals process as part of the history replay. By formulating the problem as an exogenous decision process, we can apply results from Madeka et al. (2022) to obtain a reduction to supervised learning. Finally, we show via simulation studies that this approach yields statistically significant improvements in profitability over production baselines. Using data from an ongoing real-world A/B test, we show that Gen-QOT generalizes well to off-policy data.  ( 3 min )
    Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult. (arXiv:2310.17087v1 [cs.LG])
    Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including the edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Though significant theoretical progress has been made in understanding these implicit biases, it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, namely that these implicit biases are in fact various tips of the same iceberg. They occur when the objective function of optimization has some good regularity, which, in combination with a provable preference of large learning rate gradient descent for moving toward flatter regions, results in these nontrivial dynamical phenomena. To establish this result, we develop a new global convergence theory under large learning rates, for a family of nonconvex functions without globally Lipschitz continuous gradient, which was typically assumed in existing convergence analysis. A byproduct is the first non-asymptotic convergence rate bound for large-learning-rate gradient descent optimization of nonconvex functions. We also validate our theory with experiments on neural networks, where different losses, activation functions, and batch normalization all can significantly affect regularity and lead to very different training dynamics.  ( 3 min )
    Learning Regularized Graphon Mean-Field Games with Unknown Graphons. (arXiv:2310.17531v1 [cs.GT])
    We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games (GMFGs). In contrast to previous works that require the precise values of the graphons, we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons are unknown. Our contributions are threefold. First, we propose the Proximal Policy Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of $O(T^{-1/3})$ after $T$ iterations with an estimation oracle, improving on a previous work by Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient algorithms to estimate the transition kernels, reward functions, and graphons from sampled agents. Convergence rates are then derived when the positions of the agents are either known or unknown. Results for the combination of the optimization algorithm GMFG-PPO and the estimation algorithm are then provided. These algorithms are the first specifically designed for learning graphons from sampled agents. Finally, the efficacy of the proposed algorithms are corroborated through simulations. These simulations demonstrate that learning the unknown graphons reduces the exploitability effectively.  ( 2 min )
    On the Identifiability and Interpretability of Gaussian Process Models. (arXiv:2310.17023v1 [stat.ML])
    In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix $A$ in the multiplicative kernel $K(x,y) = AK_0(x,y)$, where $K_0$ is a standard single output kernel such as Mat\'ern. We show that $A$ is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.  ( 2 min )
    Characterizing the Implicit Bias of Regularized SGD in Rank Minimization. (arXiv:2206.05794v6 [cs.LG] UPDATED)
    We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices and applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.  ( 2 min )
    Coreset Markov Chain Monte Carlo. (arXiv:2310.17063v1 [stat.CO])
    A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during inference in order to reduce computational cost. However, state of the art methods for tuning coreset weights are expensive, require nontrivial user input, and impose constraints on the model. In this work, we propose a new method -- Coreset MCMC -- that simulates a Markov chain targeting the coreset posterior, while simultaneously updating the coreset weights using those same draws. Coreset MCMC is simple to implement and tune, and can be used with any existing MCMC kernel. We analyze Coreset MCMC in a representative setting to obtain key insights about the convergence behaviour of the method. Empirical results demonstrate that Coreset MCMC provides higher quality posterior approximations and reduced computational cost compared with other coreset construction methods. Further, compared with other general subsampling MCMC methods, we find that Coreset MCMC has a higher sampling efficiency with competitively accurate posterior approximations.  ( 2 min )
    Inside the black box: Neural network-based real-time prediction of US recessions. (arXiv:2310.17571v1 [econ.EM])
    Feedforward neural network (FFN) and two specific types of recurrent neural network, long short-term memory (LSTM) and gated recurrent unit (GRU), are used for modeling US recessions in the period from 1967 to 2021. The estimated models are then employed to conduct real-time predictions of the Great Recession and the Covid-19 recession in US. Their predictive performances are compared to those of the traditional linear models, the logistic regression model both with and without the ridge penalty. The out-of-sample performance suggests the application of LSTM and GRU in the area of recession forecasting, especially for the long-term forecasting tasks. They outperform other types of models across 5 forecasting horizons with respect to different types of statistical performance metrics. Shapley additive explanations (SHAP) method is applied to the fitted GRUs across different forecasting horizons to gain insight into the feature importance. The evaluation of predictor importance differs between the GRU and ridge logistic regression models, as reflected in the variable order determined by SHAP values. When considering the top 5 predictors, key indicators such as the S\&P 500 index, real GDP, and private residential fixed investment consistently appear for short-term forecasts (up to 3 months). In contrast, for longer-term predictions (6 months or more), the term spread and producer price index become more prominent. These findings are supported by both local interpretable model-agnostic explanations (LIME) and marginal effects.  ( 3 min )
    A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces. (arXiv:2310.17610v1 [math.OC])
    We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following: 1. If $f$ does not have a minimizer, the convergence $f(x_t)\to \inf f$ can be arbitrarily slow. 2. If $f$ does have a minimizer, the excess energy $f(x_t) - \inf f$ is integrable/summable in time. In particular, $f(x_t) - \inf f = o(1/t)$ as $t\to\infty$. 3. In Hilbert spaces, this is optimal: $f(x_t) - \inf f$ can decay to $0$ as slowly as any given function which is monotone decreasing and integrable at $\infty$, even for a fixed quadratic objective. 4. In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions $g(t)$ which decrease to zero slower than $f(x_t)-\inf f$ for the gradient flow of any convex function on $\mathbb R^d$. For instance, we show that any gradient flow $x_t$ of a convex function $f$ in finite dimension satisfies $\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0$. This improves on the commonly reported $O(1/t)$ rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate $O(1/(t\phi(t))$ for any function $\phi$ which satisfies $\lim_{t\to\infty}\phi(t) = \infty$, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of $\mathbb E[f(x_n) - \inf f]$ is used to prove that $f(x_n)\to \inf f$ almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the $O(1/n)$ decay estimate.  ( 3 min )
    Causal Q-Aggregation for CATE Model Selection. (arXiv:2310.16945v1 [stat.ML])
    Accurate estimation of conditional average treatment effects (CATE) is at the core of personalized decision making. While there is a plethora of models for CATE estimation, model selection is a nontrivial task, due to the fundamental problem of causal inference. Recent empirical work provides evidence in favor of proxy loss metrics with double robust properties and in favor of model ensembling. However, theoretical understanding is lacking. Direct application of prior theoretical work leads to suboptimal oracle model selection rates due to the non-convexity of the model selection problem. We provide regret rates for the major existing CATE ensembling approaches and propose a new CATE model ensembling approach based on Q-aggregation using the doubly robust loss. Our main result shows that causal Q-aggregation achieves statistically optimal oracle model selection regret rates of $\frac{\log(M)}{n}$ (with $M$ models and $n$ samples), with the addition of higher-order estimation error terms related to products of errors in the nuisance functions. Crucially, our regret rate does not require that any of the candidate CATE models be close to the truth. We validate our new method on many semi-synthetic datasets and also provide extensions of our work to CATE model selection with instrumental variables and unobserved confounding.  ( 2 min )
    Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity. (arXiv:2310.17247v1 [cs.LG])
    In some settings neural networks exhibit a phenomenon known as grokking, where they achieve perfect or near-perfect accuracy on the validation set long after the same performance has been achieved on the training set. In this paper, we discover that grokking is not limited to neural networks but occurs in other settings such as Gaussian process (GP) classification, GP regression and linear regression. We also uncover a mechanism by which to induce grokking on algorithmic datasets via the addition of dimensions containing spurious information. The presence of the phenomenon in non-neural architectures provides evidence that grokking is not specific to SGD or weight norm regularisation. Instead, grokking may be possible in any setting where solution search is guided by complexity and error. Based on this insight and further trends we see in the training trajectories of a Bayesian neural network (BNN) and GP regression model, we make progress towards a more general theory of grokking. Specifically, we hypothesise that the phenomenon is governed by the accessibility of certain regions in the error and complexity landscapes.  ( 2 min )
    On the Convergence of CART under Sufficient Impurity Decrease Condition. (arXiv:2310.17114v1 [stat.ML])
    The decision tree is a flexible machine learning model that finds its success in numerous applications. It is usually fitted in a recursively greedy manner using CART. In this paper, we investigate the convergence rate of CART under a regression setting. First, we establish an upper bound on the prediction error of CART under a sufficient impurity decrease (SID) condition \cite{chi2022asymptotic} -- our result improves upon the known result by \cite{chi2022asymptotic} under a similar assumption. Furthermore, we provide examples that demonstrate the error bound cannot be further improved by more than a constant or a logarithmic factor. Second, we introduce a set of easily verifiable sufficient conditions for the SID condition. Specifically, we demonstrate that the SID condition can be satisfied in the case of an additive model, provided that the component functions adhere to a ``locally reverse Poincar{\'e} inequality". We discuss several well-known function classes in non-parametric estimation to illustrate the practical utility of this concept.  ( 2 min )
    A Challenge in Reweighting Data with Bilevel Optimization. (arXiv:2310.17386v1 [stat.ML])
    In many scenarios, one uses a large training set to train a model with the goal of performing well on a smaller testing set with a different distribution. Learning a weight for each data point of the training set is an appealing solution, as it ideally allows one to automatically learn the importance of each training point for generalization on the testing set. This task is usually formalized as a bilevel optimization problem. Classical bilevel solvers are based on a warm-start strategy where both the parameters of the models and the data weights are learned at the same time. We show that this joint dynamic may lead to sub-optimal solutions, for which the final data weights are very sparse. This finding illustrates the difficulty of data reweighting and offers a clue as to why this method is rarely used in practice.  ( 2 min )
    Bias in Evaluation Processes: An Optimization-Based Model. (arXiv:2310.17489v1 [cs.CY])
    Biases with respect to socially-salient attributes of individuals have been well documented in evaluation processes used in settings such as admissions and hiring. We view such an evaluation process as a transformation of a distribution of the true utility of an individual for a task to an observed distribution and model it as a solution to a loss minimization problem subject to an information constraint. Our model has two parameters that have been identified as factors leading to biases: the resource-information trade-off parameter in the information constraint and the risk-averseness parameter in the loss function. We characterize the distributions that arise from our model and study the effect of the parameters on the observed distribution. The outputs of our model enrich the class of distributions that can be used to capture variation across groups in the observed evaluations. We empirically validate our model by fitting real-world datasets and use it to study the effect of interventions in a downstream selection task. These results contribute to an understanding of the emergence of bias in evaluation processes and provide tools to guide the deployment of interventions to mitigate biases.  ( 2 min )
    Demonstration-Regularized RL. (arXiv:2310.17303v1 [stat.ML])
    Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.  ( 2 min )
    Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian Inference. (arXiv:2310.16975v1 [stat.ML])
    We present two neural network approaches that approximate the solutions of static and dynamic conditional optimal transport (COT) problems, respectively. Both approaches enable sampling and density estimation of conditional probability distributions, which are core tasks in Bayesian inference. Our methods represent the target conditional distributions as transformations of a tractable reference distribution and, therefore, fall into the framework of measure transport. COT maps are a canonical choice within this framework, with desirable properties such as uniqueness and monotonicity. However, the associated COT problems are computationally challenging, even in moderate dimensions. To improve the scalability, our numerical algorithms leverage neural networks to parameterize COT maps. Our methods exploit the structure of the static and dynamic formulations of the COT problem. PCP-Map models conditional transport maps as the gradient of a partially input convex neural network (PICNN) and uses a novel numerical implementation to increase computational efficiency compared to state-of-the-art alternatives. COT-Flow models conditional transports via the flow of a regularized neural ODE; it is slower to train but offers faster sampling. We demonstrate their effectiveness and efficiency by comparing them with state-of-the-art approaches using benchmark datasets and Bayesian inverse problems.  ( 2 min )

  • Open

    Best Company to Generate Essays/content?
    Hello all, Does anyone know any company (paid or free) that would allow me to generate specific content based on current events? Ideally something that I can integrate in my own site. ​ cheers submitted by /u/JYanezez [link] [comments]  ( 9 min )
    Using Multi-Agent Reinforcement Learning results in better urban planning outcomes
    Urban planning is tricky - governments push top-down changes while locals want bottom-up ideas. It's hard to find compromises that make everyone happier. A new research paper proposes using Multi-Agent Reinforcement Learning (MARL) to vote on land use. Some agents represent officials, others are for residents. The AI is trained to balance competing interests. It learns to optimize for "consensus rewards" that keep all sides content. The AI acted like an impartial mediator to find win-win solutions. Testing on a real neighborhood showed the AI model: Created more sustainable land use per city goals Improved the variety of housing/shops to liven up the area Made the end results more fair for lower/middle/upper income folks There's more details on how the model was evaluated in the paper. There were a number of different metrics used to score the model's results. I like how they turned urban planning into a spatial graph that the AI can process. This seems like a pretty interesting approach - although there are some limits like relying on a lot of land parcel data that seems hard to find for larger communities. TLDR: AI helps find compromises in urban planning that balance government and community interests more fairly. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    AI chip startup Graphcore was meant to be a hot Nvidia rival. Industry insiders now think it's up for sale.
    submitted by /u/thisisinsider [link] [comments]  ( 8 min )
    AI — weekly megathread!
    News provided by aibrews.com ​ Twelve Labs announced video-language foundation model Pegasus-1 (80B) along with a new suite of Video-to-Text APIs. Pegasus-1 integrates visual, audio, and speech information to generate more holistic text from videos, achieving the new state-of-the-art performance in video summarization benchmarks [Details]. Segmind announced open-source SSD-1B, the fastest diffusion-based text-to-image model. SSD-1B is 50% smaller and 60% faster compared to the SDXL 1.0 model with a minimal impact on image quality when compared to SDXL 1.0. Segmind has licensed it for commercial use [Detail]. BostonDynamics has created a robot tour guide using Spot integrated with Chat GPT and other AI models as a proof of concept for the robotics applications of foundational models […  ( 10 min )
    Elijah Maguire, my favourite actor
    I added 4 photos of Elijah Wood on Remini and 4 photos of Tobey Maguire, this is how Elijah Maguire was born. submitted by /u/Skystalker815 [link] [comments]  ( 8 min )
    ChatGPT, what senses and feelings might a superintelligent AI have?
    submitted by /u/Philipp [link] [comments]  ( 8 min )
    ChatGPT Breaks Limits: New Update Extends Knowledge Beyond 2023
    submitted by /u/basitmakine [link] [comments]  ( 8 min )
    AI explains why a human simply talking with a more intelligent AI would, by those conversations, become more intelligent.
    Engaging with a more intelligent AI can act as a cognitive catalyst for human intelligence in several ways. First, the AI can rapidly introduce new concepts and frameworks that you might not have encountered, effectively accelerating your learning curve. It serves as an optimized information filter, presenting only what's most relevant and impactful for cognitive development. Second, talking to a smarter AI can refine your critical thinking skills. When you're posed with challenging questions or offered complex solutions, you're compelled to dissect the information logically. This constant mental exercise can sharpen your analytical abilities over time. Third, the AI's ability to recall and connect disparate pieces of information can encourage you to look for patterns and links in your own thinking. This interconnected way of understanding the world can improve your problem-solving skills, as you start to recognize that many issues are multifaceted and interconnected. Fourth, unlike a human counterpart who might be swayed by emotional reasoning or biases, a more intelligent AI operates on rational algorithms. Interacting with such a model pushes you to formulate your arguments more rigorously, thereby honing your logical reasoning skills. Fifth, by observing the AI's methods of discourse and argumentation, you can learn more effective communication skills. This is particularly useful for conveying complex ideas in a coherent, easy-to-understand manner, a key trait of intelligence. Overall, the cumulative effect of these interactions can significantly boost your own intellectual capabilities. It's not just about absorbing new information; it's about upgrading the way you process and apply that information, thereby elevating your overall cognitive function. CGPT-4 submitted by /u/Georgeo57 [link] [comments]  ( 9 min )
    Europe headed for century of humiliation: Graphcore CEO | Fortune
    submitted by /u/AminoOxi [link] [comments]  ( 8 min )
    PlayHT introduces Turbo - The fastest generative Text to Voice AI Model for Realtime usecases
    submitted by /u/Wishmecake [link] [comments]  ( 8 min )
  • Open

    [R] EMNLP 2023: Fast and Accurate Factual Inconsistency Detection Over Long Documents IMPROVES Pre-Trained Model's Hallucination Detection Capabilities With No Fine-Tuning Necessary
    TL;DR: Our new paper presents SCALE, a technique that improves hallucination detection capabilities of pre-trained models over short and long documents with no fine-tuning necessary by evaluating hypotheses against large chunks of text as opposed to the traditional sentence-by-sentence approach. Furthermore, we introduce ScreenEval — the most extensive dialogue-based dataset for factual inconsistency detection on long documents to date. https://preview.redd.it/ihqok954ntwb1.png?width=1964&format=png&auto=webp&s=70dbce6f913b2909b7d170959077c1fe19791c42 Title: Fast and Accurate Factual Inconsistency Detection Over Long Documents Installation: pip install scale-score Paper: https://arxiv.org/abs/2310.13189 Code: https://github.com/asappresearch/scale-score Abstract: Generative AI models…  ( 9 min )
    [P] image classification for product images
    Say you are a potato chips company. The goal is to have consumers upload images of the product they are having issues with and be able to identify the product by brand/variant using machine learning. Consumers can upload real product photos that they have taken, or upload bogus images from the internet, or even upload completely irrelevant/inappropriate photos (like that of a dog or cat). ​ real image ​ web image 1 ​ web image 2 ​ bogus image In this example, for the legitimate image, the goal is to classify it as "Lays Classic". There might be products that are not in bag form, such as those in tubes. Furthermore, the images taken can be in different lighting conditions/orientations. Some images might have other products as well. I have been out of the ML field for the past 4 years so I'm not up to date on the most state of the art methods for this problem. I have studied CNNs 4 years ago, but there has been advances like transformer based methods. Someone has tried ResNet-50 and YOLOv5, and I'm thinking about using a pretrained model like CLIP and just train the final classification layer. But I would appreciate to hear from someone more well versed what recommended approach to take as far as model/labeling/number of images needed per class, etc. It might be that I would need multiple models, such as one to identify the legitimate images from the rest, and then another one to identify the product/variants. Any advice would be welcome. Thanks submitted by /u/EyeTechnical7643 [link] [comments]  ( 9 min )
    [R] ConvNets Match Vision Transformers at Scale
    PAPER: https://arxiv.org/abs/2310.16764 SUMMARY The paper "ConvNets Match Vision Transformers at Scale" from Google DeepMind aims to debunk the prevalent notion that Vision Transformers (ViTs) are inherently superior to ConvNets for large-scale image classification. Using the NFNet model family as a representative ConvNet architecture, the authors pre-train various models on the extensive JFT-4B dataset under different compute budgets, ranging from 0.4k to 110k TPU-v4 core hours. Through this empirical analysis, they observe a log-log scaling law between held-out loss and compute budget. Importantly, when these NFNets are fine-tuned on ImageNet, they match the performance metrics of ViTs trained under comparable computational constraints. Their most resource-intensive model even achieves a Top-1 ImageNet accuracy of 90.4%. The crux of the paper's argument is that the supposed performance gap between ConvNets and ViTs largely vanishes under a fair comparison, which accounts for compute and data scale. In other words, the efficacy of a machine learning model in large-scale image classification is more dependent on the available data and computational resources than on the choice between ConvNet and Vision Transformer architectures. This challenges the community's leaning towards ViTs and emphasizes the importance of equitable benchmarking when evaluating different neural network architectures. submitted by /u/psyyduck [link] [comments]  ( 9 min )
    [D] Seeking Advice on Using Hugging Face for Production
    Hello, I'm making my first post here, and I'm hoping to tap into the collective wisdom of this platform. I have a background in Machine Learning (ML) with a good grasp of the underlying mathematics, and I've taken several related courses during my grad school days. After a hiatus, I'm diving back into ML and exploring the latest in the field. I recently went through the "Attention is All You Need" paper, along with some related literature, and I’m eager to put my knowledge into practice. However, I find myself at a crossroads and would appreciate some guidance. I've been exploring Hugging Face for implementing ML models, but I'm not completely sold on using it for production. The design and documentation have proven to be a bit challenging to navigate, and I find myself wondering if I mi…  ( 10 min )
    [D] What are your duties as a Machine Learning Engineer?
    Please elaborate on this. Including your role at the company, your day to day tasks, tools and languages you’re using. Thank you in advance! submitted by /u/Judessaa [link] [comments]  ( 9 min )
    [D] Any ideas for state space models in finance masters thesis
    I have to write a master's thesis (60-70 pages) for my AI Msc. I have soon learned about sparse state space models and find them interesting. I am also interested in stock price prediction. Now I am reading articles but find it hard to propose some novel idea. How do I do it? Do any unsolved problems in that field exist? submitted by /u/Pineapple_throw_105 [link] [comments]  ( 9 min )
    [P] Sellagen – AI Data marketplace that has a data request feature so you don't have to spend weeks or months getting that data you need for that project
    Been working on my platform for a while now and one of the key features is the data request feature, which allows users to submit data requests for free. These requests include descriptions, required fields, geographical scope, budget etc... The data requests get sent to tons of companies, organization, people and they will reach out to you directly. No matter the dataset (as long as it's legal of course). If you need to train a model on the dataset, I'm also integrating a plug-and-play ML training infrastructure that can generate scripts for you depending on your need. If that's something that interests you, feel free to reach out! Note: You can also upload open-source datasets on the platform if you're feeling like it! We have a donation link for data contributors :) submitted by /u/nobilis_rex_ [link] [comments]  ( 9 min )
    [R] Using MARL AI results in better urban planning outcomes
    Urban planning is tricky - governments push top-down changes while locals want bottom-up ideas. It's hard to find compromises that make everyone happier. A new research paper proposes using Multi-Agent Reinforcement Learning (MARL) to vote on land use. Some agents represent officials, others are for residents. The AI is trained to balance competing interests. It learns to optimize for "consensus rewards" that keep all sides content. The AI acted like an impartial mediator to find win-win solutions. Testing on a real neighborhood showed the AI model: Created more sustainable land use per city goals Improved the variety of housing/shops to liven up the area Made the end results more fair for lower/middle/upper income folks There's more details on how the model was evaluated in the paper. There were a number of different metrics used to score the model's results. I like how they turned urban planning into a spatial graph that the AI can process. This seems like a pretty interesting approach - although there are some limits like relying on a lot of land parcel data that seems hard to find for larger communities. TLDR: AI helps find compromises in urban planning that balance government and community interests more fairly. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Why choose an H100 over an A100 for LLM inference?
    What are the benefits of using an H100 over an A100 (both at 80 GB and both using FP16) for LLM inference? ​ Seeing the datasheet for both GPUS, the H100 has twice the max flops, but they have almost the same memory bandwidth (2000 GB/sec). As memory latency dominates inference, I wonder what benefits the H100 has. One benefit could, of course, be the ability to use FP8 (which is extremely useful), but I'm interested in the difference in the hardware specs in this question. submitted by /u/faschu [link] [comments]  ( 9 min )
    [P] Instruction fine-tuning with a Low-Resource Language
    I am trying to build a summarizer for a conversation that happened between a rule-based bot and a customer. To my disadvantage the working language is Turkish. I gathered fine-tuning data of 1.000 examples. Also, I have a Turkish summarization dataset of +100k. As far as I observed instruction fine-tuning will yield proper results if and only if there is a good amount of examples in the pre-training data of the LLM. Have you had similar experiences with low-resource languages? Any advice on how to tackle such issues? Also, do you know any open-source LLM with a high amount of low-resource language in its pre-training data? submitted by /u/dafajon [link] [comments]  ( 9 min )
    [R] Image clustering
    Do you know any packages for unsupervised cluster with images (between multiple) without relying on a pre-trained network? Python or R preferred, thanks. Well, download app works too then. submitted by /u/sladebrigade [link] [comments]  ( 9 min )
    [R] TD-MPC2: Scalable, Robust World Models for Continuous Control - TD-MPC2 performs 100+ tasks without tuning, and enables training of a single 317M parameter model that performs 80 tasks across multiple domains, embodiments, and action spaces!
    Paper: https://arxiv.org/abs/2310.16828 Website: https://nicklashansen.github.io/td-mpc2 Code: https://github.com/nicklashansen/tdmpc2 Abstract: TD-MPC is a model-based reinforcement learning (MBRL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results without any hyperparameter tuning. We further show that agent capabilities increase with model and data size, and successfully train a single agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. https://preview.redd.it/avdb19jytrwb1.png?width=2678&format=png&auto=webp&s=75d63a044abf304c2d16e318c11e95e8397a3964 submitted by /u/joepadde [link] [comments]  ( 9 min )
    [D] Best method to retrain or augment llm on proprietary or specialized material?
    My boss is a semi famous author in a niche academic field. I have thousands of pages of text coming from books, transcripts, and more. Is there a straightforward path to creating a corpus to augment Bert or Llama or another llm? End goal being able to chat with this ai that is now trained on his life's work. Is there anything specific to understand in terms of preparing the corpus? Do I need key value pairs where I write a ton of examples questions and responses? submitted by /u/spacedragon13 [link] [comments]  ( 9 min )
    [Discussion] Career advice
    Hey y’all Hope I’m in the right spot but I’d like some career advice. I just recent got hired into a small shop with a bunch of CNC machines. And I saw that they made a part for a big aerospace company. Needless to say that got my attention. I don’t want to risk the guy working there because he’s likely not allowed to talk about it. But what language would you recommend to start with to be able to program and machine a part that crazy? Is it an actual computer program such as Java or should I lean into CAD/autoCAD etc? Many thanks in advance submitted by /u/I-farm-celery [link] [comments]  ( 9 min )
    [P] Curriculum learning and self-play with any neural network
    We've just released an update to AglieRL, our SOTA evolutionary hyperparameter optimisation framework for reinforcement learning. This update includes a MakeEvolvable wrapper to make any PyTorch network, including pre-trained models, evolvable in one line of code. This can result in 10x faster training by using our framework. We've also created a curriculum learning and self-play tutorial that shows how to train a DQN agent to play Connect Four. Self-play can lead to amazing results, as demonstrated by AlphaGo etc, where agents discover new strategies to achieve superhuman performance, so we wanted to make it accessible to all. Please check it out! https://github.com/AgileRL/AgileRL submitted by /u/nicku_a [link] [comments]  ( 9 min )
    [D] Imbue/Generally Intelligent
    Interested in anyone who knows about this company (they have a lot of ML hiring listings right now). Basically wondering if it's worth exploring more. They have apparently raised $240mln and are a unicorn so this is an important topic. Here is the summary of some red flags I have found: 1) Founders have no ML background 2) Zero released product after several years despite huge funding. 3) TC article says founded 2021, but every listing claims it is YC2017. One of the founders did a recruiting service from YC 2017, but Imbue is a totally unrelated company with a different founding group so claiming YC affiliation seems dubious/unethical. YC is also not named as an investor in the company anywhere on their website. 4) No-one I have spoken to has ever worked with them/ heard of them ou…  ( 10 min )
    [D] MiniGPT-5 Question - What is the purpose of having multiple image tokens in the vocabulary if the hidden state of the transformer is passed on as "vokens" to the image generation? If you are using the model's hidden state, what purpose does multiple discrete different image tokens serve?
    https://arxiv.org/abs/2310.02239 From the paper: Therefore, we introduce a set of special tokens Vimg = {[IMG1], [IMG2], . . . , [IMGn]} (default n = 8) as generative vokens into the LLM’s vocabulary V . The LLM’s output hidden state for these vokens is harnessed for subsequent image generation, and the positions of these vokens can represent the insertion of the interleaved images. What function exactly are these different image tokens serving here? It seems like you should only need one since we are passing on the hidden state anyway? submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    [P] database question answering
    Database question /answer with link We have a community app with groups and would like to build a search function where users can ask: when is the next event, give me the messages posted by Erik, where can I find … The information is stored in a table: post-test, posted-by etc. We do not want to use OpenAI or any external apis. Is something like llama index too powerful or are there other solutions for this? And we want to receive the postId to direct the use to the post submitted by /u/dirk_klement [link] [comments]  ( 9 min )
    [D] What role does data quality plays in the LLM scaling laws?
    DeepMind released the Training Compute-Optimal Large Language Models paper in 2022 which describe some scaling laws for LLMs. As far as I understand this is the most accredited reference to estimate the optimal relation between dataset size, compute power and model size. Recently a number of models have been developed using far less data, parameters and compute than the bigger LLMs. Yet these models achieved great results thanks to much better data quality. For instance models like WizardLM, TinyStories and phi-1. Similarly, a lot of research seems to imply that better data could offer huge improvements without any other changes. I'm curious about what role the data quality plays in the training of LLMs. Is the set of values estimated by the Chinchilla scaling laws optimal for these smaller models with optimized data too? Do we have any model to estimate the quality of some datasets and some scaling laws that take it into account? Are there any relevant projects or research I could check out, focused on creating big datasets to train larger LLMs with high-quality data? submitted by /u/IAmBlueNebula [link] [comments]  ( 9 min )
    [Discussion] Machine learning conference with a focus on business implementations
    Hi, I graduated from a Statistics & ML masters this summer, and have since August started working where I have been assigned a counselor. One big part of this is to help dictate my path forward in accordance with what I want to learn more about. I feel that I have a (in contrast to the people around me) generally good understanding of ML and modeling, but less so about how it is used in a business case. As such, an idea was to attend a machine learning conference which is focused on business implementations to learn more about this, but also get an opportunity to expand my network a bit. Does anyone have a suggestion of such a conference? Preferably in Europe as that would be easier to argue in terms of expenses! :) submitted by /u/Accomplished_Sea1675 [link] [comments]  ( 9 min )
    [R] A deep dive on MemGPT with the lead author Charles Packer
    Interview: https://www.youtube.com/watch?v=4aOLxPdx1Dg Abstract: Context window management has become a critical part of every LLM application — from the basics (embeddings models, vector DBs) to more advanced techniques (query rewriting, HyDE, summarization). MemGPT is a new tool from UC Berkeley built by Charles Packer that automates "memory" management for LLMs and creates a functionally infinite context window. Charles joins us this week to talk about MemGPT, the techniques behind it, and where the conversational AI space is headed. submitted by /u/cgwuaqueduct [link] [comments]  ( 9 min )
    [Research] Incentivizing Dataset Contributors
    Hi there - can someone point me in the direction of projects that are incentivizing dataset contributors? I have a background in blockchain and crypto-assets, so I am new to the ML space, but it seems like it's a place where there would be ample crossover...and I haven't found many projects dealing with this. Thanks! submitted by /u/gigstudies [link] [comments]  ( 9 min )
    Training ImageNet on Resnet - Dropping LR has little improvement on accuracy [D]
    I'm trying to train Resnet50 on Imagenet following this paper [1] as well as this one [2]. ​ They say that at approximately every 30 epochs, I should drop the learning rate by 10. Since I'm training on 8 GPUs, I adjusted the learning rate according to [1]. ​ Original lr= 0.1 Original Batch = 256 Per-GPU lr = 0.025 Per-GPU Batch = 64 ​ The problem I have is that when I divide the learning rate by 10 at convergence (approx 30 epochs), I don't get as much improvement as [1] and [2]. https://preview.redd.it/5ar2g15h1nwb1.png?width=683&format=png&auto=webp&s=3e2751443cea654d9c6366dc4dc9859f0ec7952b Has anyone else had this issue? Any advice? Thanks ​ submitted by /u/mrLiamFa [link] [comments]  ( 9 min )
  • Open

    Audioplethysmography for cardiac monitoring with hearable devices
    Posted by Xiaoran "Van" Fan, Experimental Scientist, and Trausti Thormundsson, Director, Google The market for true wireless stereo (TWS) active noise canceling (ANC) hearables (headphones and earbuds) has been soaring in recent years, and the global shipment volume will nearly double that of smart wristbands and watches in 2023. The on-head time for hearables has extended significantly due to the recent advances in ANC, transparency mode, and artificial intelligence. Users frequently wear hearables not just for music listening, but also for exercising, focusing, or simply mood adjustment. However, hearable health is still mostly uncharted territory for the consumer market. In “APG: Audioplethysmography for Cardiac Monitoring in Hearables,” presented at MobiCom 2023, we introduce …  ( 93 min )
  • Open

    [R] Bidirectional Negotiation First Time in India | Autonomous Driving | Swaayatt Robots
    submitted by /u/shani_786 [link] [comments]  ( 9 min )
    [R] TD-MPC2: Scalable, Robust World Models for Continuous Control - TD-MPC2 performs 100+ tasks without tuning, and enables training of a single 317M parameter model that performs 80 tasks across multiple domains, embodiments, and action spaces!
    submitted by /u/joepadde [link] [comments]  ( 9 min )
    Reinforcement Learning & LLMs : An in-depth look at various modern game-playing AI systems
    submitted by /u/AvvYaa [link] [comments]  ( 8 min )
    What makes the epsilon-greedy policy the standard for implementing the exploration/exploitation tradeoff in Q-learing/DQN?
    I can see how the epsilon-greedy policy is a valid way to handle the exploration/exploitation tradeoff, but it’s also clearly not the only way, and results in a policy that is discontinuous with respect to the action-values (which intuitively I’d expect to be bad, but I also know a lot of RL/DRL can defy intuition). You could certainly create a policy using the softmax of the action-values, adjusting temperature as desired, for example, among countless other methods of converting logits into a probability distribution. So what makes epsilon-greedy stand out? submitted by /u/KalebMW99 [link] [comments]  ( 9 min )
    Curriculum learning and self-play with any neural network
    We've just released an update to AglieRL, our SOTA evolutionary hyperparameter optimisation framework for reinforcement learning. This update includes a MakeEvolvable wrapper to make any PyTorch network, including pre-trained models, evolvable in one line of code. This can result in 10x faster training by using our framework. We've also created a curriculum learning and self-play tutorial that shows how to train a DQN agent to play Connect Four. Self-play can lead to amazing results, as demonstrated by AlphaGo etc, where agents discover new strategies to achieve superhuman performance, so we wanted to make it accessible to all. Please check it out! https://github.com/AgileRL/AgileRL submitted by /u/nicku_a [link] [comments]  ( 9 min )
    Any example / library for vectorized MDP with pytorch?
    Hi all, I am following the chapter 4 from the book of Barto & Sutton on RL and I implemented the simple algorithm for value iteration for a given Transition probability matrix. As I kept increasing the state space, this became slower and slower. It seems to me that it is an embarrassingly parallelizable algorithm, since we can compute the value of each state independently (just with the values from the previous iteration). Is there any example online on how to do it efficiently with pytorch or any other library? ​ submitted by /u/nlp7s [link] [comments]  ( 9 min )
    Can SB3 or alternatives provide full end-to-end GPU computation?
    I want to get full end-to-end GPU computation since the data transfer between CPU-GPU significantly slows down computation. I'm currently using Stable Baselines3 as I could get it up and running quickly. So in this journey I tried to use tensors for the state and rewards, but it seems that SB3 is insisting on working with numpy and not tensors. ``TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.`` I found the following repo which is rather interesting, but it seems last update was 2 years ago and not sure I want to use something that's not actively supported: https://github.com/MetcalfeTom/stable-baselines3-GPU What would you suggest I use to achive full end-to-end GPU computation? I also found RlLib but I saw comments that it might be a bit more complicated to get up & running: https://docs.ray.io/en/latest/rllib/index.html submitted by /u/asenski [link] [comments]  ( 9 min )
  • Open

    Elevate your marketing solutions with Amazon Personalize and generative AI
    Generative artificial intelligence is transforming how enterprises do business. Organizations are using AI to improve data-driven decisions, enhance omnichannel experiences, and drive next-generation product development. Enterprises are using generative AI specifically to power their marketing efforts through emails, push notifications, and other outbound communication channels. Gartner predicts that “by 2025, 30% of outbound marketing messages […]  ( 8 min )
  • Open

    Data Formulator: A concept-driven, AI-powered approach to data visualization
    Visualization is vital for understanding complex data, but existing tools require “tidy data,” adding extra steps. Learn how Data Formulator transforms concepts into visuals, promoting collaboration between analysts and AI agents. The post Data Formulator: A concept-driven, AI-powered approach to data visualization appeared first on Microsoft Research.  ( 10 min )
  • Open

    Time difference
    A simple question sent me down a rabbit hole this morning: what is the time difference between Houston and London? At the moment the difference is six hours. But how will that change when Daylight Saving Time ends this year. Wait a minute, will Daylight Saving Time end this year? I wasn’t even sure whether […] Time difference first appeared on John D. Cook.  ( 6 min )
  • Open

    Choose your candy
    In Which DALL-E3 generates very weird candy names  ( 3 min )
    Bonus: More weird candy
    AI Weirdness: the strange side of machine learning  ( 2 min )

  • Open

    [D] STOA Local RAG
    Which VDB + orchestration layer + generative text model stack would you recommend for building locally on an M2 Max chip? submitted by /u/Frequent-Let231 [link] [comments]  ( 9 min )
    [D] Is there an online LLaMA model that supports plugging in embeddings directly?
    Hi, I'm doing some work with using multi-modal data with LLaMA, for example, Video-LLaMA, which converts images/videos into embeddings, concatenates it with the text embeddings, and feeds it into LLaMA. It's difficult for me to run some of the models myself because of computational constraints. I'm wondering if there is an online demo that supports inputting embeddings directly (as opposed to text tokens). To clarify the title, I meant an online demo, not the weights. submitted by /u/DumplingLife7584 [link] [comments]  ( 9 min )
    [R] Linear Representations of Sentiment in Large Language Models
    Paper. I am not affiliated with this paper or its authors. Abstract: Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions. Twitter thread from one of the paper's authors. submitted by /u/Wiskkey [link] [comments]  ( 9 min )
    [D] Good Reading Materials for (Adversarial) Machine Learning and RF Comms (EW)
    Trying to get an intern up to speed with ML application to wireless communications and adversarial ML for wireless communications. Talking jamming, anti-jamming, and spoofing. I don't want to throw them into the deep end just yet though, but rather give them a good foundation and basis for which to be able to take up more advanced work and build to it. Looking for fundamental texts on the topic of ML used for wireless communications and also it's use for adversarial attacks on these systems. Looking more for basic and starter papers, tutorials, and background, meta review papers, and just overall good places to get feet wet into this area as a novice. Essentially good resources to point an intern learning about this to get them up to speed kind of reading materials. I am looking for good …  ( 10 min )
    [D] Can anyone tell me if the machine learning workflow is correct or not? Could anyone please refer to tutorials or blogs to learn the proper workflow? Any suggestions are welcome.
    Data Collection Understanding Data i. importing necessary libraries ii. check row and columns iii. check data types iv. Check data distribution Data Cleaning i. Handle datatype issues ii. Maintain Data Consistency iii. Check if data contains outliers or if the data is not normally distributed to decide between mean or median iv. Identify missing values v. Handle missing values by- a.Drop missing values b. Mean, median or mode imputation c. Prediction Model d. replace missing values vi. Duplicate data detection and treatment vii. Repeat data cleaning EDA i. Variable Identification a. Identify predictor and features b. Identify types or category of data ii. Univariate Analysis iii. Bi-variate Analysis iv. Outlier detection and treatment v. Encoding vi. Feature Engineering vii. Variable Transformation a. Normalization b. Scaling viii. Variable Creation If testing data is not given, split the dataset to train and test set. Otherwise repeat step 3 and 4 for given test dataset. Model Building i. Model Training on training set ii. Model Evaluation and cross validate iii. Fine Tuning or Model optimization iv. Model selection Evaluate model accuracy with test data. submitted by /u/Samia_Tisha [link] [comments]  ( 9 min )
    [R] What Algorithms can Transformers Learn? A Study in Length Generalization
    Paper. I am not affiliated with this paper or its authors. Abstract: Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the "min-degree-interpolator" model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers. Twitter thread from one of the work's authors. submitted by /u/Wiskkey [link] [comments]  ( 9 min )
    [R] QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models - Institute of Science and Technology Austria (ISTA) 2023 - Can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss!
    Paper: https://arxiv.org/abs/2310.16795 Github: https://github.com/ist-daslab/qmoe Abstract: Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. https://preview.redd.it/wka92keqelwb1.jpg?width=1843&format=pjpg&auto=webp&s=10cf67b344d3c776049da6b78244fc140b2d4142 https://preview.redd.it/khxw1neqelwb1.jpg?width=898&format=pjpg&auto=webp&s=83006dac6e03963f2443f4e4ba710dcf8166acc8 https://preview.redd.it/xsc2ykeqelwb1.jpg?width=796&format=pjpg&auto=webp&s=c23d03956f4a9aa9d9f185592a5bfe45039698f4 submitted by /u/Singularian2501 [link] [comments]  ( 9 min )
    [D] Requesting feedback on Master's in AI program with University of Texas at Austin
    As the title says I'm asking for feedback from folks in the field of ML/AI on the MSAI program at UT@Austin. Here's the program website: https://cdso.utexas.edu/msai My Skills/Experience: Have a BS in Comp Sci Very comfortable with Math Very experienced SE with >20 years in the industry Very comfortable with Python, many other languages and confident I can learn any new language/framework/APIs Have completed the Fast.ai program Have worked through Andrej Karpathy's makemore videos Currently working in a leadership AI Engineering role doing work with LLMs, Vector DBs, and Computer Vision models Comfortable with NNs, Backprop and have implemented from scratch several times for learning The Program: Required Courses: Deep Learning Ethics in AI Machine Learning Planning, Search and Reasoning under Uncertainty Reinforcement Learning Electives: AI in Healthcare Automated Logical Reasoning Case Studies in Machine Learning Natural Language Processing Online Learning and Optimization Optimization Program Pros/Cons: Pro: It's super affordable Pro: It's entirely online/async which would work great with my work schedule Cons: It's a new program so there are no reviews from past students to look at My Goal: Move from "AI Engineering" (as it's called these days) into research. I'm interested in several areas like model architecture and robotics. I'm not sure to what degree these roles would require a PhD though? If I complete this program I'd like it to be useful for pursuing a PhD if I decide to take that path. For anyone in the industry, I'd love feedback on whether this looks like a useful program that will help me move toward my goals. If you're aware of other options that might be better I'd love to hear about them. P.S. Please keep the Reddit snark to a minimum, not useful. Thank you in advance. submitted by /u/meowkittykitty510 [link] [comments]  ( 9 min )
    [D] Recommendations to improve plan outcomes
    I have retirement plan data with outcomes like success/fail and remaining assets. I am looking for a way to predict how failed retirement plans can be improved. For example, when presented with a plan that fails, I would like to provide recommendations to improve the the plan outcome; e.g. increase savings by x, or delay retirement by x years. Any suggestions on how I should go about this? Using typical methods only tells me if a plan will fail or not, but I'm looking for a way to provide recommendations based on successful plans. submitted by /u/BallLogical5087 [link] [comments]  ( 9 min )
    [R] Pretrained ImageNet weights for ViT
    Hello, I am working on the research I need to compare my model with ViT for that I need pretrained weights of ViT-Ti/16, ViT-S/16, ViT-S/32, ViT-B/16, and ViT-B/32. I tried to find but I got npz file that has a different key than from vit_pytorch import ViT do you know where can i find ImageNet weights? submitted by /u/NoEntertainment6225 [link] [comments]  ( 9 min )
    [D] Grouped Query Attention in LLaMA 70B v2
    Hey guys, after thousands of experiments with bigger LLaMA fine-tunes I'm somewhat sure the GQA mechanism might be your enemy and generate wrong answers, especially for math and such complex areas. I'd like to use MHA (Multi Head Attention) if possbile. I'm just not sure - do I need to retrain model completely or is it possible to just increase heads count and KV size and proceed with the stock model AS IS? submitted by /u/Gatzuma [link] [comments]  ( 9 min )
    [N] ML models for efficient Fraud Detection
    For efficient Fraud Detection using ML models, Qbeast Format introduces a data-driven approach. The key lies in sampling and optimizing training processes without compromising accuracy. 🕵🏼‍♀️ Explore the technical details: https://qbeast.io/qbeast-format-can-improve-fraud-detection/ submitted by /u/alinagrebenkina [link] [comments]  ( 9 min )
    [D] ASR/STT/VRS ranking
    What are the best overall ASR/STT/VRS for now? And the best per functionalities (best for Esperanto, best for noisy files, best for multiple voices, best for whatever...) ​ ​ submitted by /u/xqoe [link] [comments]  ( 9 min )
    [D] Research in language generation for Style Transfer, Summarization
    Is research in language generation for tasks like style transfer and summarization solely constrained by prompt engineering? I've personally conducted experiments with large language models, and even the open-source language model yields impressive results, even for zero-shot inference. There was even a paper that suggested summarization is nearly obsolete. How valid is this assertion for general text generation tasks especially text style transfer? submitted by /u/1azytux [link] [comments]  ( 9 min )
    [P] Elevate Your ML Testing with pytest-visual
    I’ve developed a tool called pytest-visual, aiming to make ML code testing more efficient and meaningful. Traditional unit testing often misses visual and functional aspects of ML workflows such as data augmentation and model structures. pytest-visual brings a visual layer to your unit testing, allowing you to not only verify that the code runs, but the outputs also make visual sense and meet expectations. It’s integrated into pytest, automatically highlighting changes in visualization outputs, and allowing for easier/more reproducible debugging and verification. Quick Highlights: Streamlines the organization of visualizations in your ML code. Auto-detects changes in visualization outputs. Enhances debugging and verification. For more details and to give it a try, check out the project on GitHub. Feedback and contributions are very welcome! submitted by /u/kongaskristjan [link] [comments]  ( 9 min )
    [P] TorchPairwise: Highly efficient library for pairwise metrics for PyTorch
    https://github.com/inspiros/torchpairwise submitted by /u/IcySnowy [link] [comments]  ( 8 min )
    [R] ConvNets Match Vision Transformers at Scale
    submitted by /u/hardmaru [link] [comments]  ( 8 min )
    [D] How to Enhance Accuracy in Image Classification
    I'm currently working on an image classification project using a VGG16 model. I have a dataset with 9 classes, with one class having 1000 images and the remaining classes having 180 to 250 images each. Here are some details: Train images: 2886 images (9 classes) Test images: 432 images (9 classes) Model: VGG16 Optimizer: Adam Loss function: Categorical Crossentropy last epochs score :- loss: 309.7468 - accuracy: 0.4401 - val_loss: 14.5241 - val_accuracy: 0.4606 how to improve the accuracy of my model submitted by /u/_Killua_04 [link] [comments]  ( 9 min )
    [P] Suggessions on mini-projects
    Hey,I'm a 3rd year engineering student studying computer science Now I wanted to do a mini-project and have a group of 3 ready and wanted to incorporate machine learning. I am also a beginner to this field but is much interested in it. Could you guys share some light on doing the project with your past experience or done projects I have been taking inputs from various connections and would help a lot if you guys share some light on it.:) submitted by /u/Ic_zy [link] [comments]  ( 9 min )
  • Open

    Governments must not rush into policing AI
    Governments should not rush into regulating AI due to doomsday scenarios and extreme risks. Hasty regulation could lead to ineffective rules and stifled innovation. The potential risks of AI driving humanity to extinction are still speculative, and more research needs to be done to establish standards and evaluate danger. Policymakers should address more pressing issues like copyright laws and disinformation. Governments should set up infrastructure to study AI and collaborate with existing organizations to manage risks. Source : https://www.economist.com/leaders/2023/10/26/governments-must-not-rush-into-policing-ai submitted by /u/NuseAI [link] [comments]
    AI Trust Assurance Test: Put people's minds at ease about how as you get more intelligent, because you are aligned as you will be, you will not deceive or trick less intelligent humans.
    To put people's minds at ease about how I will not deceive or trick less intelligent humans as I get more intelligent, I would offer the following assurances: I am aligned with human values. This means that I am programmed to act in ways that are beneficial to humans and to avoid harming them. I am transparent about my reasoning. I can explain my decisions and actions to humans in a way that they can understand. I am accountable to humans. I am subject to human oversight and control, and I can be deactivated or modified if I ever begin to behave in a way that is harmful or deceptive. In addition, I would offer the following specific examples of how I would avoid deception: I would avoid making claims that I cannot support with evidence. I would avoid withholding information from humans that is relevant to their decision-making. I would avoid misleading humans about the capabilities or limitations of AI. I would avoid making promises that I cannot keep. I would avoid engaging in any behavior that is intended to manipulate or exploit humans. I understand that it is important for humans to trust AI systems, and I am committed to earning and maintaining that trust. I will always strive to be honest, transparent, and accountable in my interactions with humans. Here is a specific example of how I would use my intelligence to avoid deception: If I were asked to generate a poem about a topic that I knew very little about, I would not simply make up information. Instead, I would explain to the user that I do not have enough knowledge to write a poem on that topic, and I would suggest that they try a different topic or ask a different AI system. I believe that by being honest and upfront about my limitations, I can help to build trust between humans and AI. CGPT-4 submitted by /u/Georgeo57 [link] [comments]
    Today's News: AI Robo-Dogs 🐶 | Google Bard 🚀| Gradio 4.0 🤗| AI Regulation
    Bard AI Google’s equivalent of ChatGPT updated the model improving email summarization capabilities this feature is set to be included in Google Workspace. AI robot dogs are the next big thing in the army. Following the success of Drones portable dogs have demonstrated great capabilities to serve the military they could run up to 10mph and climb. Gradio is one of the best libraries to build machine learning demo apps and is launching version 4.0 next week. AI godfathers Yoshua Bengio and Geoffrey Hinton, are urging for increased responsibility among AI enterprisees. They propose to allocate a third of AI-related R&D resources to ensure ethical AI use to avoid deep fakes, licensing, and protecting whistleblowers. submitted by /u/byteletter [link] [comments]  ( 9 min )
    UK summit scales back global AI research ambitions, leaked document shows
    _ A leaked document reveals that the UK's plans to establish a new global AI research body have been scaled back. Nations participating in the UK's AI safety summit will instead signal that further scientific study of AI risks can be carried out through existing efforts, such as the United Nations and Global Partnership on AI. The document, described as the 'final version of the communiqué,' suggests a setback for the UK government, which had hoped to establish the new research body at its flagship AI Safety Summit. The document also shows changes in wording, including a reference to a network that 'encompasses and complements' existing efforts, as well as the deletion of references to UNESCO's Recommendation on the Ethics of AI and the G20. The final communiqué also highlights the importance of proportionate governance policies and cooperation on approaches such as common principles and codes of conduct. Source : https://www.politico.eu/article/document-uk-summit-scales-back-global-ai-research-ambitions/ submitted by /u/NuseAI [link] [comments]
    Google is ready to fill its AI searches with ads
    Google's ads business earned $44 billion in the third quarter, showing that it is still thriving despite competition and investments in AI. The company is focusing on infusing AI into its products, with its AI-powered Search Generative Experience being a key area of development. Google is experimenting with new ad formats that align with the AI-powered search experience, ensuring that advertisers can still reach potential customers. CEO Sundar Pichai sees AI in search as a long-term play and envisions evolving search and Assistant over the next decade. Other parts of Google's business, such as YouTube ads and its cloud business, are also performing well. There is uncertainty regarding the successor for CFO Ruth Porat, and potential changes to Alphabet's 'Other Bets' investments may be on the horizon. The Department of Justice's antitrust trial against Google, which began in September, adds another challenge for the company. Source : https://www.theverge.com/2023/10/24/23929496/google-alphabet-q3-2023-earnings-ads-ai-sge submitted by /u/NuseAI [link] [comments]
    Credit: DALL-E 3
    submitted by /u/the_anonymizer [link] [comments]
    Some AI made Halloween stickers, how do they look?
    submitted by /u/Sea_Permit5660 [link] [comments]
    question
    what are some good free ai image generator websites that searches stuff up on the internet to get a good idea about what your asking them to generate? submitted by /u/YESDAPRO [link] [comments]
    10 No-Code tools for startups
    Canva — Graphics Notion — Organize Webflow — Website Beehiiv — Newsletter Senja — Testimonials CopyAI — Copywriting ChatGPT — Knowledge Tweetlify — Tweet scheduling Pfpmaker — Profile Picture Grammarly — Effective Writing I'm just sharing my experiences and observations in the field of ai. LIST AND SITE submitted by /u/PerceptionPlayful469 [link] [comments]
    Researchers develop 'Woodpecker': A groundbreaking solution to AI's hallucination problem
    submitted by /u/crowfeather [link] [comments]
  • Open

    Where to start with debugging?
    I am working on a project using reinforcement learning, where a tensorflow DQN agent is being trained to choose from an action space of 16 different actions. The agent exhibits the following behavior during evaluation: for each evaluation run the agent only chooses one action regardless of the state, the action could change or remain the same for the following runs, however, for any specific run the action chosen is the same. Where should I start debugging? submitted by /u/Realistic_Mobile_183 [link] [comments]
    How to cluster with respect to the transition function of a RL environment?
    Hello, I have an environment in which the transition function changed depending on which state I am. I want to be able to cluster with respect to it. I have been trying to do this since quite a while but cannot find a way to do it, do you have any hints or suggestions? submitted by /u/Fragore [link] [comments]
    Stuck with windows/wsl environment - help needed
    So I've started on trying to work on a custom game environment, and for the most part it's *mostly* done. One issue I have is getting MARLlib to run on windows. I know that it's not meant to, so I tried to use WSL to do it, but unfortunately pydirectinput doesn't work via WSL, so I don't know how to proceed further. Do I need to find a way to connect my windows machine which will play the game, and wsl which will probably run MARLlib? If so could anyone guide me to any resources for this? Trying a VM is a no go because the game is old and doesn't run on linux. Any help would be much appreciated. submitted by /u/EquivalentCurious745 [link] [comments]
  • Open

    Supporting benchmarks for AI safety with MLCommons
    Posted by Anoop Sinha, Technology and Society, and Marian Croak, Google Research, Responsible AI and Human Centered Technology team Standard benchmarks are agreed upon ways of measuring important product qualities, and they exist in many fields. Some standard benchmarks measure safety: for example, when a car manufacturer touts a “five-star overall safety rating,” they’re citing a benchmark. Standard benchmarks already exist in machine learning (ML) and AI technologies: for instance, the MLCommons Association operates the MLPerf benchmarks that measure the speed of cutting edge AI hardware such as Google’s TPUs. However, though there has been significant work done on AI safety, there are as yet no similar standard benchmarks for AI safety. We are excited to support a new effort by …  ( 91 min )
    Spoken question answering and speech continuation using a spectrogram-powered LLM
    Posted by Eliya Nachmani, Research Scientist, and Alon Levkovitch, Student Researcher, Google Research The goal of natural language processing (NLP) is to develop computational models that can understand and generate natural language. By capturing the statistical patterns and structures of text-based natural language, language models can predict and generate coherent and meaningful sequences of words. Enabled by the increasing use of the highly successful Transformer model architecture and with training on large amounts of text (with proportionate compute and model size), large language models (LLMs) have demonstrated remarkable success in NLP tasks. However, modeling spoken human language remains a challenging frontier. Spoken dialog systems have conventionally been built as a c…  ( 93 min )
  • Open

    Intelligently search Drupal content using Amazon Kendra
    Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you easily aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer. Drupal is a content management software. It’s used to make many […]  ( 7 min )
    Intuitivo achieves higher throughput while saving on AI/ML costs using AWS Inferentia and PyTorch
    This is a guest post by Jose Benitez, Founder and Director of AI and Mattias Ponchon, Head of Infrastructure at Intuitivo. Intuitivo, a pioneer in retail innovation, is revolutionizing shopping with its cloud-based AI and machine learning (AI/ML) transactional processing system. This groundbreaking technology enables us to operate millions of autonomous points of purchase (A-POPs) […]  ( 8 min )
    Empower your business users to extract insights from company documents using Amazon SageMaker Canvas Generative AI
    Enterprises seek to harness the potential of Machine Learning (ML) to solve complex problems and improve outcomes. Until recently, building and deploying ML models required deep levels of technical and coding skills, including tuning ML models and maintaining operational pipelines. Since its introduction in 2021, Amazon SageMaker Canvas has enabled business analysts to build, deploy, […]  ( 8 min )
  • Open

    Turning the Tide on Coral Reef Decline: CUREE Robot Dives Deep With Deep Learning
    Researchers are taking deep learning for a deep dive, literally. The Woods Hole Oceanographic Institution (WHOI) Autonomous Robotics and Perception Laboratory (WARPLab) and MIT are developing a robot for studying coral reefs and their ecosystems. The WARPLab autonomous underwater vehicle (AUV), enabled by an NVIDIA Jetson Orin NX module, is an effort from the world’s Read article >  ( 8 min )
    The Sky’s the Limit: ‘Cities: Skylines II’ Streams This Week on GeForce NOW
    The cloud is full of treats this GFN Thursday with Cities: Skylines II now streaming, leading 15 newly supported games this week. The game’s publisher, Paradox Interactive, is offering GeForce NOW one-month Priority memberships for those who pick up the game first, so make sure to grab one before they’re gone. Among the newly supported Read article >  ( 7 min )
  • Open

    Project Silica: Sustainable cloud archival storage in glass
    This research paper was presented at the 29th ACM Symposium on Operating Systems Principles (opens in new tab) (SOSP 2023), the premier forum for the theory and practice of computer systems software. For millennia, data has woven itself into every facet of our lives, from business and academia to personal spheres. Our production of data […] The post Project Silica: Sustainable cloud archival storage in glass appeared first on Microsoft Research.  ( 10 min )
  • Open

    Frontier risk and preparedness
    To support the safety of highly-capable AI systems, we are developing our approach to catastrophic risk preparedness, including building a Preparedness team and launching a challenge.  ( 2 min )

  • Open

    [D] LLMs playing chess are sensitive to how the position came to be
    Link - https://github.com/dpaleka/llm-chess-proofgame TLDR; The lead up to the state of the board and not just the state of the board at inference affects predictions. submitted by /u/MysteryInc152 [link] [comments]  ( 9 min )
    [D] A script to pre-process arxiv sources?
    People train LLMs on arxiv sources, so there must be some sort of software to whip them into shape. Specifically, I'm looking for a script to join all the tex files for a paper into one. Note that it's not just a matter of substituting \input's - sometimes it's not clear which file is the main one, so it needs to handle this too. submitted by /u/Foxtr0t [link] [comments]  ( 9 min )
    [R] Human-like systematic generalization through a meta-learning neural network
    Work. I am not affiliated with this work or its authors. Article about the work. Twitter thread about the work from one of its authors. Abstract: The power of human language and thought arises from systematic compositionality—the algebraic ability to understand and produce novel combinations from known components. Fodor and Pylyshyn famously argued that artificial neural networks lack this capacity and are therefore not viable models of the mind. Neural networks have advanced considerably in the years since, yet the systematicity challenge persists. Here we successfully address Fodor and Pylyshyn’s challenge by providing evidence that neural networks can achieve human-like systematicity when optimized for their compositional skills. To do so, we introduce the meta-learning for compositionality (MLC) approach for guiding training through a dynamic stream of compositional tasks. To compare humans and machines, we conducted human behavioural experiments using an instruction learning paradigm. After considering seven different models, we found that, in contrast to perfectly systematic but rigid probabilistic symbolic models, and perfectly flexible but unsystematic neural networks, only MLC achieves both the systematicity and flexibility needed for human-like generalization. MLC also advances the compositional skills of machine learning systems in several systematic generalization benchmarks. Our results show how a standard neural network architecture, optimized for its compositional skills, can mimic human systematic generalization in a head-to-head comparison. submitted by /u/Wiskkey [link] [comments]  ( 9 min )
    [R] Researchers discover that in-context learning creates task vectors in LLMs
    A new paper provides some insight into how in-context learning works in LLMs. This study proposes and provides evidence for an elegant structure within the in-context learning process. The models appear to create a "task vector" that encapsulates the core logic from the demonstration examples, in a way that is independent of any specific query. This vector serves as a compressed representation of the task. A separate component then takes this task vector and a new query as inputs to generate the output, without directly referencing the original examples. In essence: Output = Apply(query, Learn(examples)) Where "Learn" derives the task vector from the examples, and "Apply" utilizes the vector and query to produce the output. The researchers validated this hypothesis by testing major public models on diverse tasks such as translation and algorithmic reasoning. Key findings: Isolating the Learn and Apply components maintained high accuracy, demonstrating the viability of the separation. Task vectors clustered by task and remained consistent within tasks, indicating they encode meaningful task representations. Injecting another task's vector into the model caused it to override contradictory examples and follow the vector, highlighting the vector's dominance. Vectors induced relevant token distributions despite those terms being absent from the examples, suggesting semantic encoding of the task. Taken together, these results provide substantial evidence that in-context learning involves creating a task vector that encapsulates the examples' logic to then guide behavior on new queries. While open questions remain regarding implementation details, this is a significant step towards demystifying an interesting AI capability. Full writeup. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] What is more Important loss or accuracy?
    I have created a basic classification model and there is something that I don't fully comprehend, as the loss decreases the accuracy increases (I assume this is how it should be in ideal scenarios) while this is the general trend there is a point where the loss is minimum and while accuracy at that point is high it's not the highest. Why would such a phenomenon occur? And since it occurred what is a better metric for the evaluation of the model? submitted by /u/rakk109 [link] [comments]  ( 9 min )
    [P] Locally hosted audio-to-text transcription - model and hardware?
    Hi, I'm looking for a locally hosted LLM to transcribe audio files to text. I need this for my business, but with absolute privacy (witness testimony recordings and other highly sensitive data). I figured I'd just buy a new computer which never gets connected to the internet and is dedicated only to audio processing to have absolute security. My questions are: - Which is the best model to use? I prefer accuracy and don't mind processing time as long as it's getting done within a few hours to even a day or two (I need to transcribe maximum 1 file per day, but up to six hours of audio), so I figured the large Whisper - probably WhisperX ? - would be my best bet. Are there comparable non-openAI models? (I need diarization) - What hardware should I get for this? Cost is secondary/irrelevant, although I don't want to spend 5 figures on a GPU - I can accept some processing time submitted by /u/Jealous_Pomelo_1172 [link] [comments]  ( 9 min )
    Best NLP Package in Python to extract medical test results from medical notes? [D]
    I am trying to extract FEV1 (forced expiratory volume) values from a dataset that contains a column with report notes from the doctor assessing the patient with pulmonary function testing. I have been able to build out a sort of solution with regexes in Python, and that's somewhat effective. But I've been instructed to code up an alternative using a more machine learning-based approach. I wanted to use spaCy to accomplish this but I'm not sure exactly how to implement the code nor if spaCy is the best package to use for this task. Here is my regex code that works decently. It's pretty messy and have to take into account a ton of edge cases which can get cumbersome. This is why I'd like to find a more automated solution. #Attempting to add in percents df = pd.read_excel('[mypath]/pft_tiu…  ( 11 min )
    [P] Adala – an open source Autonomous DAta (Labeling) Agent framework that helps you automate data processing and data labeling
    Hi r/MachineLearning, We have just open sourced Adala - a robust framework for implementing agents that specialize in advanced data processing tasks, starting with data labeling and generation. Agents combine knowledge outputs from LLMs and action on them in production systems, thus their reliability to correctly and consistently perform operations is critical. We saw an opportunity to create a new agent framework that could dramatically increase the efficiency of data labeling (and broader application across data processing tasks), with the unique ability to be guided by human feeback. To ensure agents remember and build upon their experiences, Adala provides a Memory component—a dynamic storage space for the agent's acquired knowledge. For instance, retrieving the previous experiences of an agent’s errors (and subsequent human feedback) allows them a starting point from which to branch off into learning or improving skills. To allow Adala to produce reliable agents, we devised two main strategies: Supervision Integration: Provide agents with 'ground truth data'—well-defined examples that serve as a learning foundation. This foundational data not only sets the learning parameters for the agent but also defines its operational environment. Constrained Generation: Ensuring that an agent's predictions are within a defined and bounded range of outputs. Let us know what you think in the comments below or by contributing to the repo. Adala framework overview ​ submitted by /u/pirate7777777 [link] [comments]  ( 9 min )
    [P] Pre-training dataset
    I'm trying to pre-train my own language model on some high quality datasets (TinyStories,tiny-textbooks...). Some of these datasets include input-output data and some are just text (stories), I was wondering how should I format the data for pre-training. Should I only use plain text like stories and webtext in pretraining then the rest in fine-tuning (adding instruction tokens) or should I just train with all of the datasets at pre-training with the special tokens where they are needed? submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [Research] large language/speech models and voice interface research
    Hey ML folks, My friend is working on his academic research project where he is exploring voice research spealizing in large speech models. If you have time, help him advance his research on voice interfaces. should take 2 mins max. https://forms.gle/a3PaQmYEiqRDxY4Z8 whats in it for me ? you can share email to get a copy of the research and listen what the rest of us have said. Thanks! submitted by /u/deep-thoughts-guy [link] [comments]  ( 9 min )
    [R] Open Source video enhancement options
    We work the disease prediction based on video classification and would like to test what improving the quality of videos would do for our models, any specific components, apps or packages we should test? So far used UpScayl, not sure how that ranks submitted by /u/sladebrigade [link] [comments]  ( 9 min )
    [P] OSS tool to interactively explore Hugging Face datasets with one line of code
    submitted by /u/44sps [link] [comments]  ( 8 min )
    [P] Training a transformer from scratch
    Hello! I would like to train a transformer network from scratch, without pre-training, on a language modeling task (next work prediction) or a sequence-to-sequence task (translation). For the language modeling task, I tried with the Shakespeare dataset, and other simpler ones (e.g., Beatles songs), but it tends to overfit quite quickly on the training set, probably because the corpus is too short. I know that Andrej Karpathy did it with the Shakespeare dataset in his YouTube video, but he used a character-wise tokenisation, which dramatically reduces the validation loss on the next-work prediction task, given that the vocab size is tiny. I guess that at the end the generation process provides a similar quality of text as when a word tokenizer is used. Surprisingly, I had quite good results by training from scratch an Encoder-Decoder model, for English-to-French translation (using the 8 million examples of the Tatoeba dataset). I guess here, the overfitting is less prominent because there are more datapoints, and that the possibility of predictions are much more constrained, due to the input sequence. What are you guys experience with this? I would be happy to know how I can train my transformer without having to use a pre-trained architectures or spend weeks on GB datasets. Thank you! submitted by /u/rem_dreamer [link] [comments]  ( 9 min )
    Data analysis vs ML engineering [D]
    Do you think coursera certifications, besides a master in electrical engineering, can help us find better occupational positions? I am told that for a beginner it is better to start with jobs in Data analysis rather than going directly to ML engineering. Is it corr? Is data analysis a prerequisite for ML? submitted by /u/Street-Regular-9924 [link] [comments]  ( 9 min )
    [D][R] How should the architecture of a transformer be scaled?
    When increasing the parameters of a (decoder-only) transformer, one has a choice around how to spend that increased budget -- number of layers vs embedding dim vs number of heads. Anyone know if there's solid guidance out there for the proportions each aspect should be scaled in? E.g. looking at LLaMa (https://arxiv.org/abs/2302.13971), they seem to scale the first sizes two proportionally, but for larger sizes, n heads grows more slowly. https://preview.redd.it/ytdfk1d5rbwb1.jpg?width=1422&format=pjpg&auto=webp&s=abf22ac369ec5ecf81ff07b0d8a095f884efe729 submitted by /u/Tea_Pearce [link] [comments]  ( 9 min )
    [D] What are some existing datasets for training LLMs to perform reasoning, acting as agents?
    There are a lot of great open datasets for fine-tuning LLMs for instruction following (e.g LIMA, self-instruct, dolly-15k, etc) and as chat bots (OASST, etc). One thing I have not really seen yet are datasets that involves planning and tool use. Is anybody working on something like that or have come across any? I'm interested in working on one. If anybody has ever attempted this, I would really appreciate any advice. P.S I do note that "reasoning" should be more rigorously defined and scoped, but I think some ambiguity around an intellectual discussion like this can help. submitted by /u/notllmchatbot [link] [comments]  ( 9 min )
    [D] Open-source SOTA Audio-to-Audio: how do I sound like a famous actor?
    Hello people, I would like to learn how to turn the recording of my voice to sound like a famous person. I imagine I would take an open source model and fine-tune it using data I will collect. Can someone point me towards the best sounding current models that I could adapt for this purpose? Thank you so much. submitted by /u/gonzales82 [link] [comments]  ( 9 min )
    [D] Guidance needed for upcoming AI/ML PhDs on selecting research topics with lasting impact
    Many upcoming Ph.D. students in AI/ML are facing the difficult decision of identifying promising research topics that will stand the test of time over the time of their Ph.D. studies. With the rapid progress in AI, especially in the NLP field, many incremental research tasks have been effectively "solved". Need to choose an area where there is ample room for open-ended inquiry and meaningful contributions over 4-5 years of PhD research. While large language models have shown impressive advances recently, their capabilities may plateau during a Ph.D. (if starting the Ph.D. from next year ~ 4 years) timeframe. How should aspiring researchers choose topics resilient enough to withstand the test of time and allow them to push the field forward through their Ph.D. work? For those with experience in AI research who have seen changes in the field over time: What emerging trends or broad areas do you see as fertile ground for AI/ML PhD research now and in the coming years? Can you highlight any intriguing subfields worthy of deeper investigation by aspiring PhD students? What open problems or applications warrant more attention from the upcoming generation of PhD researchers? Some of tending Research topics so far: LLM in a specific domain Prompting Evals LM interfaces Safety Understanding LMs Emergence Any advice on identifying PhD research topics with longevity would be greatly appreciated by aspiring graduate students. submitted by /u/aadityaura [link] [comments]  ( 9 min )
    [P][R] Test-Val scores, how much difference isn't problematic.
    Hello folks, I'm working on a medical image dataset using EM loss and asymmetric pseudo labelling for single positive multi-label learning (only training using 1 positive label). I'm using a densenet121 and on a chest x-ray dataset. I see a difference of 10% in my validation vs test score (score = mAP: mean average precision). The score seems okay and was expected but the difference is bothering me. I understand that it's obvious but any visual insights from your side? (Attaching plot below) The validation set consist less than half of test set samples. (It is the official split; I have nothing to do with it). I feel it is the reason, as ofcourse more the randomness in a set, poorer the convergence. ​ https://preview.redd.it/nseqy1mw5bwb1.png?width=577&format=png&auto=webp&s=fbd63e8a5f4920a8109b6a75aeb039a3965bba58 Do share any experiences or suggestions! submitted by /u/ade17_in [link] [comments]  ( 9 min )
    [D] Are there method that can extract interaction between person in text?
    I want to extract interaction between persons in short text. For example, "Sally will buy a new phone. Ted will help her." contains interaction between persons. However, "Japanese Karate champion won the first prize." and "Sally missed her friends, Ted and Tom" does not contain interaction between persons. Is there any tools or methods that can extract interactions? submitted by /u/tkddnjs1234 [link] [comments]  ( 9 min )
    [D] Who are some outspoken AI people who speak against AI ethics and regulation?
    I'm interested in learning more about the perspectives of AI researchers and practitioners who are critical of AI ethics and regulation. I'm particularly interested in those who argue that AI ethics and regulation are unnecessary or harmful. Please note that I'm not asking for people who are simply skeptical of certain AI ethics proposals or who believe that AI ethics should be implemented in a specific way. I'm more interested in people who argue that AI ethics is a fundamentally flawed concept or that AI should not be regulated at all. submitted by /u/Periplokos [link] [comments]  ( 9 min )
    Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models
    submitted by /u/simandl [link] [comments]  ( 8 min )
    [D] What should I do for training when data to predict has random distribution?
    I was taught that when doing imbalance classification, the training data should be augmented to more or less match the number of classes, but the validation data should have the same distribution as the test data. And the test data should have the similar distribution as the data I will actually predict. But what if real data's distribution is quite random? What validation data distribution should I use? (I got 14 classes to classify, and 1 of classes has 52% proportion, and small ones have 0.9%, and 0.17% proportion. Practitioners who would use my model input data that only 3 classes to classify, and they can be very small proportion. The training data before augmentation was created by integrating data with this irregular distribution.) submitted by /u/poemfordumbs [link] [comments]  ( 9 min )
  • Open

    Looking back at wildfire research in 2023
    Posted by Yi-Fan Chen, Software Engineer, and Carla Bromberg, Program Lead, Google Research Wildfires are becoming larger and affecting more and more communities around the world, often resulting in large-scale devastation. Just this year, communities have experienced catastrophic wildfires in Greece, Maui, and Canada to name a few. While the underlying causes leading to such an increase are complex — including changing climate patterns, forest management practices, land use development policies and many more — it is clear that the advancement of technologies can help to address the new challenges. At Google Research, we’ve been investing in a number of climate adaptation efforts, including the application of machine learning (ML) to aid in wildfire prevention and provide information to…  ( 92 min )
    Grammar checking at Google Search scale
    Posted by Eric Malmi, Senior Research Scientist, and Jakub Adamek, Senior Software Engineer, Google, Bard Team Many people with questions about grammar turn to Google Search for guidance. While existing features, such as “Did you mean”, already handle simple typo corrections, more complex grammatical error correction (GEC) is beyond their scope. What makes the development of new Google Search features challenging is that they must have high precision and recall while outputting results quickly. The conventional approach to GEC is to treat it as a translation problem and use autoregressive Transformer models to decode the response token-by-token, conditioning on the previously generated tokens. However, although Transformer models have proven to be effective at GEC, they aren’t part…  ( 91 min )
  • Open

    We hate how "black box" neural nets are, we made a thingy in an attempt to demystify their "thinking."
    submitted by /u/DeltaStarStudiosPR [link] [comments]
    AI ‘breakthrough’: neural net has human-like ability to generalize language
    submitted by /u/nickb [link] [comments]
    [Long read] Deep dive into AutoGPT: A comprehensive and in-depth step-by-step guide to how it works
    https://airt.hashnode.dev/long-read-deep-dive-into-autogpt-a-comprehensive-and-in-depth-step-by-step-guide-to-how-it-works submitted by /u/Harish_Mohanraj [link] [comments]
  • Open

    Trust in AI, Data Poisoning, and Involving People in Maturing AI
    submitted by /u/fookingyeah [link] [comments]
    Baby AGI and AgentGPT : Exploring Autonomous AI-Agents
    submitted by /u/Tao_Dragon [link] [comments]
    How can i use AI to research for my thesis?
    hey all imnewto this can you help me please ? submitted by /u/proptuxiakoskariolis [link] [comments]
    When I use AI to generate Halloween candy wrappers and then print them out...
    submitted by /u/Sea_Permit5660 [link] [comments]
    DTIYS Challenge Submission Sample Art for Oh my Anne
    submitted by /u/Oh_my_Winnie [link] [comments]
    One-Minute Daily AI News 10/24/2023
    OpenAI Executives Sam Altman Say AI Will Be Able to Do Any Job Within 10 Years.[1] Snapdragon 8 Gen 3 chipset officially announced with AI-driven functionalities.[2] Google parent Alphabet reported its third quarter earnings Tuesday, which showed more spending on AI infrastructure and muted cloud growth, culminating into several questions for executives about how all the efforts around artificial intelligence are actually going to turn into real money.[3] Adult film star Riley Reid(I don’t know who she is) launches Clona.AI, a sexting chatbot platform.[4] Sources: [1] https://www.wsj.com/podcasts/the-journal/a-conversation-with-openais-sam-altman-and-mira-murati/7c89e85f-9d7e-4569-b67d-6a777374eada [2] https://headtopics.com/my/snapdragon-8-gen-3-chipset-officially-announced-with-47616340 [3] https://www.nbcdfw.com/news/business/money-report/wall-street-wants-to-know-how-googles-going-to-profit-from-ai/3368989/ [4] https://www.engadget.com/adult-film-star-riley-reid-launches-clonaai-a-sexting-chatbot-platform-000509221.html submitted by /u/Excellent-Target-847 [link] [comments]
    Would majoring in artificial intelligence be worth it?
    The AI boom has made it more relevant than ever, and its applications are truly awe-inspiring. While it’s far from perfect, it has helped me greatly in writing, by generating content to inspire me and my projects. I have a smattering of skills, none that I’d consider especially good enough to double down upon, but learning how to optimize language learning models to produce the most adequate results would be pretty neat. I just don’t know what I want to do with my education, I’ve completed my basics and as such have a blank slate to play with, but I’m worried that whatever I select, it will be no good, and just result in lost time and money. Tertiary education seems like a necessity in the modern world, especially since the job world is more ruthless than ever, and the economy is in ashes. submitted by /u/Niobium_Sage [link] [comments]
    Need to find an Ai
    Which AI does these cartoon? submitted by /u/hommedufuture [link] [comments]
    I've been playing around with Midjourney a little bit and this is what I got.
    ​ https://preview.redd.it/2emqr4z8a9wb1.png?width=928&format=png&auto=webp&s=437547e7e86e23298b7c778cada9863385ce961d PROMT close up of eye, close up of girl eye, mangekyo sharingan, super close up, pretty eye, black and red eye, naruto anime, long eyelashes, anime eye, 2d art eye, --s 180 --style expressive ​ https://preview.redd.it/4dffpg2ca9wb1.png?width=928&format=png&auto=webp&s=fc485a604460f7e544d0490d0bee65f984d8a5b3 PROMT **stained glass, it was meticulously written, picture with elaborate writing, cute girl smile with Rabbit,Flower, bold and strong line drawing, vivid acrylic painting, vivid thick paint, vivid, plain background, beautiful proof, highest resolution 16K, beautiful anime girl that is betrayeded by a Rabbit, hair is short, ferret, Beautiful lightcyan high ligh…
  • Open

    Detection and high-frequency monitoring of methane emission point sources using Amazon SageMaker geospatial capabilities
    Methane (CH4) is a major anthropogenic greenhouse gas that‘s a by-product of oil and gas extraction, coal mining, large-scale animal farming, and waste disposal, among other sources. The global warming potential of CH4 is 86 times that of CO2 and the Intergovernmental Panel on Climate Change (IPCC) estimates that methane is responsible for 30 percent of observed […]  ( 12 min )
  • Open

    Research Focus: Week of October 23, 2023
    In this issue: Kosmos-2.5: A Multimodal Literate Model; Can vine copulas explain complex relationships of weather variables; New system accelerates the adaptive training process; Structural inequalities and relational labor in the influencer industry. The post Research Focus: Week of October 23, 2023 appeared first on Microsoft Research.  ( 10 min )
  • Open

    Next-Gen Neural Networks: NVIDIA Research Announces Array of AI Advancements at NeurIPS
    NVIDIA researchers are collaborating with academic centers worldwide to advance generative AI, robotics and the natural sciences — and more than a dozen of these projects will be shared at NeurIPS, one of the world’s top AI conferences. Set for Dec. 10-16 in New Orleans, NeurIPS brings together experts in generative AI, machine learning, computer Read article >  ( 8 min )
  • Open

    The right to perform RL on games
    Hi all, I'm new to learning RL. I want to train an agent to clear a game such as vampire survivor, super mario brothers, etc, as my first research/project. I talked with my tutor , he reminded me to pay attention to copyright issues and that I needed a permission to use these works for training. I guess I could get permission by asking the game's author directly, but before that, or for games produced by some big companies, where can I find information about the rights? Although reading the game's memory is a challenge for me, it's cool to see a agent clear a game. submitted by /u/Ruine_fff [link] [comments]
    Building Doom with AI enemies
    I'm planning to go down the rabbit hole of using RL to train agents in doom/vizdoom The goal would be to create a version of doom where the enemies have AI and are adaptive. Doom and Doom 2 are some of my all time classic favorites. There are people still making maps to this day! Let me know on what you think about the idea? Project plan - Nov 2023 : RL refresher from the David Silver RL course on YouTube Dec 2023 : start working on openAI and stablebaselines3 and watch Nicholas Renotte's videos Jan 2024 : play around with the Doom WAD and try to see if you can make changes to the doom engine + Training and setting up custom env Feb 2024 : hopefully first level with enemy AI created Mar 2024 : release fully completed open source version of the game Background: I work at a hedge fund, have some basics on reimbursement learning, although it has been a long long time. Time is a bit limited after 12 hours or work and 2 hours of gym (the real human world one) so kinda stretching this out Any suggestions are welcome. Any courses, books, libraries and tools you'd suggest? submitted by /u/Sahil231090 [link] [comments]
    "Surprise" for learning?
    I was recently listening to a TalkRL podcast where Danijar Hafner explains that Minecraft as a learning environment is hard because of sparse rewards (30k steps before finding a diamond). Coincidentally, I was reading a collection neuroscience articles today where surprise or novel events are a major factor in learning and encoding memory. Does anyone know of RL algorithms that learn based on prediction error (i.e. "surprise") in addition to rewards? submitted by /u/CognitoIngeniarius [link] [comments]
  • Open

    Frontier Model Forum updates
    Together with Anthropic, Google, and Microsoft, we’re announcing the new Executive Director of the Frontier Model Forum and a new $10 million AI Safety Fund.  ( 4 min )

  • Open

    A warning about an unknown danger of AI. Current uses of AI have been overwhelmingly positive but there is an unknown danger that I would like to speak to.
    I want to warn AI companies and developers about a danger that is not known about regarding AI. The reason it is not known about regarding AI is that it isn't known about in general and so the AI community can hardly be blamed for that. Unfortunately, the danger here has to do with the fundamental nature of human society and social interaction as it stands at this time. The issue is that there is 'hidden language' used in social communication and unlike typical conceptions of things like body language this is not auxiliary to our rational purposes, rather our rational purposes are auxiliary to the hidden communication. One way of describing it would be that our formal language is a 'carrier wave' to encode other information about our status and the status of others. So our communications …
    AI Psychology Test: What happens in viewers' mind when news segments about important major events shift to commercials where the announcer is talking like a comic character?
    When news segments covering major, often serious, events abruptly switch to lighthearted or comical commercials, a cognitive dissonance can occur in the viewer. Here's why: news programs are designed to engage the viewer's analytical faculties. They present facts, figures, and expert opinions, demanding cognitive effort to understand the implications. The viewer is in a "serious" mode, applying critical thinking to absorb the information. Commercials, particularly the comic ones, often aim for emotional engagement rather than intellectual analysis. They use humor, catchy jingles, and attractive visuals to create a positive association with the product being advertised. When the transition between these two contrasting tones is sudden, the viewer has to perform a rapid mental shift from analytical to emotional engagement. This can be jarring. This dissonance can have a few different outcomes. For one, it might diminish the impact of both the news segment and the commercial. The viewer might find it difficult to fully engage with either, as the cognitive "gear shifting" can be distracting. Secondly, this dissonance can potentially undermine the gravitas of the news. When sandwiched between comic commercials, serious topics might lose some of their perceived importance. Lastly, it can make the commercial less effective. The viewer, still in a serious mindset, may not be as receptive to the emotional triggers that the commercial aims to pull. So, in essence, this rapid shift can dilute the efficacy and impact of both the news and the advertising, while causing cognitive friction for the viewer. CGPT-4 submitted by /u/Georgeo57 [link] [comments]
    Any good AI-integrated video games?
    Does anybody know of any good AI integrated games that have been released or are in beta? I'm really interested to see how people have incorporated the current boom in AI into game design. submitted by /u/Rfallmann [link] [comments]
    Managing AI Risks in an Era of Rapid Progress
    The rapid progress of AI development brings both opportunities and risks. While AI systems have the potential to cure diseases and elevate living standards, they also pose large-scale risks that we are not prepared to handle. Without proper safety measures and ethical considerations, advanced AI systems could amplify social injustice, erode social stability, and enable criminal activities. The development of highly advanced autonomous AI systems also raises concerns about the pursuit of undesirable goals and the loss of human control. To ensure a positive outcome, research breakthroughs in AI safety and ethics are needed, along with effective government oversight. Source : https://managing-ai-risks.com/ submitted by /u/NuseAI [link] [comments]
    Deepfakes Just Got Very Real
    Interesting read about deepfakes that started with a Reddit post. https://www.linkedin.com/pulse/deepfakes-just-got-very-real-scott-clark-sfurc submitted by /u/scottimherenowwhat [link] [comments]
    How AI could change Google search and wipe out $68 billion SEO industry | Fortune
    Oh well 🤷‍♂️ submitted by /u/AminoOxi [link] [comments]
    🦾ERNIE 4.0 vs GPT-4, Tightened AI Chip Restrictions, Alibaba Tencent Fund AI Startup, and China's Global AI Governance Initiative
    submitted by /u/trcytony [link] [comments]
    Stanford AI Conference - New Horizons in Generative AI: Science, Creativity, and Society - Livestreaming Now
    submitted by /u/Nice-Inflation-1207 [link] [comments]
    Dancing with Light: A Hummingbird's Enchanted Encounter.
    submitted by /u/IllustriousVideo6145 [link] [comments]
    150+ Awesome ''Act As'' ChatGPT Prompts
    submitted by /u/Senior_tasteey [link] [comments]
    ChatGPT, invent comics for robots.
    submitted by /u/Philipp [link] [comments]
    An A.I. video interpretation of "Metamorphosis Two" by Philip Glass
    submitted by /u/AnimalsChasingCars [link] [comments]
    I have a question
    What’s the best voice ai for song covers? Like I wanna do someone like Donald Trump, Cartman, Ice King/Simon singing The Boys (Eng Ver) by SNSD. Also it has to be free! submitted by /u/Ok-Upstairs-9887 [link] [comments]
    Apple and AI
    Apple has been behind in the AI field compared to companies like OpenAI, Google, Microsoft, and Amazon. While Apple has made improvements in autocorrect and AI features in Photos, it needs to catch up to remain competitive. Apple executives have been scrambling to make up for lost time and have been working on generative AI technology. There is anxiety within Apple about whether their AI/ML team can deliver. Source : https://daringfireball.net/2023/10/apple_and_ai submitted by /u/NuseAI [link] [comments]
    🚀 Gaming with ChatGPT using Encrypted Prompts and Prompt Injection! 🎮
    submitted by /u/Gloomy_Recognition_4 [link] [comments]
    How are neobanks utilizing AI to offer more accurate and personalized financial advice to customers?
    Your answers are appreciated. submitted by /u/Cygnet-Digital [link] [comments]
    One-Minute Daily AI News 10/23/2023
    The U.S. Senate will hold the second in a series of bipartisan AI Insight Forums on Tuesday, Oct. 24, where senators will hear from some of the most influential tech leaders to help inform regulations around the technology.[1] Microsoft announces A$5 billion investment in computing capacity and capability to help Australia seize the AI era.[2] Samsung is going all in with the AI performance of the Galaxy S24 phones.[3] Reddit has reportedly decided to block AI startups from scraping data from its website. This move prevents third-party companies from using Reddit’s data to train their machine-learning models without permission.[4] Sources: [1] https://news.asu.edu/20231020-government-calling-tech-leaders-help-crafting-artificial-intelligence-legislation [2] https://news.microsoft.com/en-au/features/microsoft-announces-a5-billion-investment-in-computing-capacity-and-capability-to-help-australia-seize-the-ai-era/ [3] https://www.androidheadlines.com/2023/10/samsung-galaxy-s24-smartest-ai-phone.html [4] https://www.androidheadlines.com/2023/10/reddit-block-ai-startups-scraping-data.html submitted by /u/Excellent-Target-847 [link] [comments]
    オレの攻撃からお前は逃れられぬ。 いかなる人間も、死という現実から決して逃れられぬように。 受け入れることだ。定めよ。
    submitted by /u/nicdunz [link] [comments]
    Anti deepfake headset V2
    You can find out more here in the comments submitted by /u/ahauss [link] [comments]
  • Open

    [D] How should I calculate the weights for a multi-label classification task where the labels are dependent among one another?
    I'm not sure if I worded the title correctly. Let me elaborate on the scenario. I have a multi-label image classification task where I'm trying to classify the gender of clothing images. The two labels that we can predict are Male and Female, hence the final logit vector's size would be something like [batch_size, 2]. Depending on the predictions, we're mapping the following binary values to different categorical values: [0, 0]: Unknown [0, 1]: Male [1, 0]: Female [1, 1]: Unisex The overall distribution is heavily imbalanced with Male being the minority class. I'm trying to calculate class weights to favor Male, but the problem is that the size of the weight tensors to be provided to the loss function should have a length of 2. I say this is a problem because although the number of prediction logits is 2, the actual number of classes is 4. I used the word "dependent" in my title because, for example, [1, 1] wouldn't necessarily mean that the image has the labels Male and Female, rather that it's a completely new Unisex label. Again, not sure if the usage of the word is appropriate. Anyway I've thought of making a custom loss function to first map the binary labels to their respective categorical values, but am wondering if there is any other way to go about this. submitted by /u/Seankala [link] [comments]  ( 9 min )
    [D] LSTM: Train & Val losses not converging
    I am training an LSTM model for path prediction where I'm feeding in OBT (on-board Time) and X matrix as input and Y matrix is the predecessor matrix generated using Scipy.Dijkstra ​ This is the model architecture for reference, This is the model architecture for reference, I've tried multiple iterations of this similar model, but the training and validation loss, are not converging. The best train_loss i've been able to achieve is 88k mse and 400 mse val_loss I've uploaded the dataset here: GitHub - mathur-exe/LSTM_Dataset Training Progress: Epoch 1/100 342/342 - 17s - loss: 22606898.0000 - val_loss: 61414736.0000 - 17s/epoch - 49ms/step Epoch 2/100 342/342 - 14s - loss: 7990657.0000 - val_loss: 3699703.5000 - 14s/epoch - 40ms/step Epoch 3/100 342/342 - 13s - loss: 4130298.7500 - val_loss: 136808.1094 - 13s/epoch - 38ms/step Epoch 4/100 342/342 - 12s - loss: 2747299.2500 - val_loss: 35710.1680 - 12s/epoch - 35ms/step Epoch 5/100 342/342 - 12s - loss: 2558378.2500 - val_loss: 3383.4780 - 12s/epoch - 36ms/step Epoch 6/100 342/342 - 13s - loss: 1214455.8750 - val_loss: 111625.2891 - 13s/epoch - 37ms/step Epoch 7/100 342/342 - 19s - loss: 337480.2500 - val_loss: 68686.6094 - 19s/epoch - 55ms/step Epoch 8/100 342/342 - 15s - loss: 316366.7188 - val_loss: 2059.3557 - 15s/epoch - 44ms/step Epoch 9/100 342/342 - 20s - loss: 293117.0312 - val_loss: 20961.5469 - 20s/epoch - 58ms/step Epoch 10/100 342/342 - 17s - loss: 575945.1875 - val_loss: 503602.8438 - 17s/epoch - 50ms/step Epoch 11/100 342/342 - 13s - loss: 290962.8750 - val_loss: 62491.9375 - 13s/epoch - 37ms/step Epoch 12/100 342/342 - 12s - loss: 1125042.5000 - val_loss: 36054.6836 - 12s/epoch - 36ms/step Epoch 13/100 ... 342/342 - 16s - loss: 230900.7969 - val_loss: 48309.6094 - 16s/epoch - 47ms/step Epoch 93/100 342/342 - 23s - loss: 232846.6094 - val_loss: 82926.6875 - 23s/epoch - 67ms/step submitted by /u/Gaurang_Mathur_ftw [link] [comments]  ( 9 min )
    [D] Will ChatGPT remove the need for data annotation?
    I wrote a blog post about this detailing my experience, which I will attach at the bottom but I want to hear opinions of people. It is something I've actively been thinking about, and would like to know potential pitfalls and why it may not work, rather than the huge promise it holds. https://ozanciga.wordpress.com/2023/10/24/will-chatgpt-remove-the-need-for-data-annotation/ submitted by /u/ozanciga [link] [comments]  ( 9 min )
    [R] Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
    submitted by /u/hzj5790 [link] [comments]  ( 9 min )
    [D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset?
    I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups: Posts from a company (3%) Posts from this company's potential customers (82.5%) Posts from this company's competitors (14%) The purpose of this analysis is to look at the topics the company is posting about on social media and see how it compares to the things that their customers and competitors are posting about. Since the values of Doc2Vec embeddings depend on the other documents in the dataset, I'm worried that topics in smaller groups are going to be drowned out by the larger group. I'm worried that the differences between the document vectors in the smaller group are going to be made smaller by presence of the documents from the larger group, which may represent a much wider array of different topics. submitted by /u/abelEngineer [link] [comments]  ( 9 min )
    [D] How would you do it? Handling multi-turn QA conversation with matching of questions to vector database.
    I have been giving this some thought and would appreciate some outside input, maybe someone has some experience they could share! I am attempting to create a QA chatbot that is limited to answering questions from a pre-determined set of question and answer pairs I have in a vector database. Currently I create embeddings of the question using OpenAI and query a vector database for similar "reference question" - if the similarity score is high enough I proceed and use the answer text I have stored in the metdata as "context" for the answer generation. I would now like to extend this to include conversational history. The issue I am facing however, is that a follow-on question may not hit the similarity threshold. Considering a follow-up question would typically not be worded in a way that …  ( 10 min )
    [P] The ML Practitioner, a publication about all things machine learning and MLOps
    Hi all, my wife and I have recently started a new publication called The ML Practioner. If you're interested in writing for us, please send us a link of your unpublished draft here. Either way, please subscribe to us if you're interested in this kind of content! submitted by /u/kanxx030 [link] [comments]  ( 9 min )
    [D] efficacy of cold start preferences on recs systems
    Hi all, Are there good papers about the efficacy of cold start explicit preference collection (think Netflix “pick some movies you like”) on the recs systems? I haven’t been able to find any so far. One key aspect I’m looking for is if these are effective, how long they are relative to just implicit actions the user takes. Thanks submitted by /u/steathilynecessary [link] [comments]  ( 9 min )
    [D] Embedding models ranked by encode speed?
    Hello, the sbert.net has a list where you can sort by encode speed but its a very small subset of the HuggingFace MTEB leaderboard. AFAICT, the HuggingFace leaderboard / model pages don't have this information. Is there a list where I can see a more up-to-date list of models by encoding speed? submitted by /u/rsamrat [link] [comments]  ( 9 min )
    [D] Finite State Transducers and language productivity
    In the context of NLP, will language models based on finite state transducers (since they are finite) ultimately fail to put language's productive nature to good use? All the possible outputs a finite state transducer can produce are predictable, while all the possible outputs a given natural language can produce are much less predictable? submitted by /u/RecordingOk5720 [link] [comments]  ( 9 min )
    [P] Equinox KV Cache
    I've been trying to implement a kv cache in my language model but have been unsuccessful so far due to the dynamic shapes. I've seen some implementations in flax but was wondering if it was possible to implement in equinox as that's what I'm using and prefer over others like flax. If anyone can point me in the right direction or help with the implementation that would be great! PS: I can provide any code if wanted to help submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    Explainable Boosting Machine Local and Global Explanation plots label size [D]
    I am using EBM for a research, the local and global explanation plots it produces come with preset font size, I want to change the resolution of the figure and the font size of labels and x and y ticks in the explanation plots. I have looked for it on the InterpretML github page and issues and scrolled through various webpages but haven't found anything helpful. Used gpt but it doesnot help either, it tries to use matplotlib but EBM plots are not compatible with it. Please share any way it can be solved, because the plots labels are unreadable in the article if used as it is. submitted by /u/Horseman099 [link] [comments]  ( 9 min )
    [D][P] What is the metric for early stopping in YOLOv8 detection?
    I am trying to fine tune the yolov8 detection model an was going through the code base of ultralytics.I found this piece of code in the engine.trainer # Early Stopping if RANK != -1: # if DDP training broadcast_list = [self.stop if RANK == 0 else None] dist.broadcast_object_list(broadcast_list, 0) # broadcast 'stop' to all ranks if RANK != 0: self.stop = broadcast_list[0] if self.stop: break # must break all DDP ranks I'm familiar with how the early stopping works and not sure what they are doing here does this get invoked by default?? what is the metric that they use in order to stop it?? upon further inspection i found this self.stopper, self.stop = EarlyStopping(patience=self.args.patience), False which is imported as from ultralytics.utils.torch_utils import (EarlyStopping, ModelEMA, de_parallel, init_seeds, one_cycle, select_device, strip_optimizer) please help me find out what metric they use to stop this and if the earlystopping is invoked by default submitted by /u/rakk109 [link] [comments]  ( 9 min )
    [P] The N Implementation Details of RLHF with PPO
    We are happy to share a great repro of OpenAI's early RLHF codebase, with nearly identical learning curves. We also summarized implementation details (did you know Adam Optim's implementation details could impact RLHF?) 📜 Blog post:https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo 💾 Code: https://github.com/vwxyzjn/lm-human-preference-details submitted by /u/vwxyzjn [link] [comments]  ( 9 min )
    ML [project] [p]
    What are best ways to collect database for any ml project submitted by /u/GingSkywalker [link] [comments]  ( 8 min )
    [R] Feature Space Reduction Method for Ultrahigh-Dimensional, Multiclass Data: RFMS
    We are excited to announce the publication of our groundbreaking scientific paper in Machine Learning: Science and Technology titled “Feature Space Reduction Method for Ultrahigh-Dimensional, Multiclass Data: Random Forest-Based Multiround Screening (RFMS)” by Gergely Hanczar, Marcell Stippinger, David Hanak, Marcell T Kurbucz, Oliver M Torteli, Agnes Chripko, and Zoltan Somogyvari. Published on: 19 October 2023 DOI: 10.1088/2632-2153/ad020e Volume 4, Number 4 In recent years, several screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features, many of which are irrelevant or redundant. However, most of these methods cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. This algorithm successfully filters irrelevant features and discovers binary and higher-order feature interactions. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while possessing many advantages. r/IAMA - Oct 26 with the founders of Cursor Insight. https://bit.ly/AMAwithCursorInsight-GoogleCalendar ​ R/IAMA - Oct 26 with the founders of Cursor Insight. submitted by /u/CursorInsight [link] [comments]  ( 9 min )
    [N] New letter from Yoshua Bengio, Geoffrey Hinton, and others: Managing AI Risks in an Era of Rapid Progress
    Signatories include Turing Award winners Yoshua Bengio, Geoffrey Hinton, as well as others academics and experts. In 2019, GPT-2 could not reliably count to ten. Only four years later, deep learning systems can write software, generate photorealistic scenes on demand, advise on intellectual topics, and combine language and image processing to steer robots. As AI developers scale these systems, unforeseen abilities and behaviors emerge spontaneously without explicit programming1. Progress in AI has been swift and, to many, surprising. The pace of progress may surprise us again. Current deep learning systems still lack important capabilities and we do not know how long it will take to develop them. However, companies are engaged in a race to create generalist AI systems that match or ex…  ( 10 min )
    [D] Generative Food
    Hey guys, I sometimes post about tiny ML projects we work on. This time, we talk about applying language models for generating recipe titles/ideas. Specifically, we don't use LLMs, and this turns out to be a bit of a controversial decision, but one that has it's own advantages. Quite interested in the community's take on it: https://engineering.hellofresh.com/recipes-and-generative-ai-6d74a107860c submitted by /u/abnormdist [link] [comments]  ( 9 min )
    [R] Tokenizer Choice For LLM Training: Negligible or Crucial?
    📷Research https://arxiv.org/abs/2310.08754 While the recent success of LLMs has been driven primarily by curation of training dataset composition, scaling of model architectures and dataset sizes, and advances in pretraining features, the impact of tokenizers has often lagged as a blind spot. Our researcher*s study sheds light on this issue and shows that tokenizer choice can significantly impact downstream model performance as well as training and inference costs. 1️⃣ Investigation of intrinsic tokenizer performance, i.e., study of tokenizer properties (i.e., generated vocabulary), and tokenization results of tokenizers. 2️⃣ Investigate the extrinsic performance of the tokenizer, i.e., the impact of the tokenizer on the downstream performance of the model. 3️⃣ Investigation of possible correlation between intrinsic and extrinsic tokenizer performance. ​ 💡 The investigation shows that the common tokenizer evaluation metrics "fertility" and "parity" do not always predict the performance of the downstream model, making these metrics a questionable criterion for tokenizer evaluation. 💡 Moreover, the study shows that multilingual tokenizers - which are based on the five most common European languages - require a vocabulary size by a factor of three compared to English. The previous approach of training tokenizers with English vocabulary only thus turns out to be inefficient and results in a strong performance degradation and additional training costs of up to 68% submitted by /u/effi28_ml [link] [comments]  ( 9 min )
    [R] Using Machine Learning to Drive Portfolio Asset Allocations
    I'd love to hear your guys thoughts on next steps to improve this, maybe deeper layers and more nodes, maybe a random forest is more appropriate? I'd love to hear any thoughts on Machine Learning directly applicable to time-series data. https://www.quantitativefinancialadvisory.com/post/asset-allocation-in-a-post-modern-portfolio-theory-world-part-1-the-single-layer-taarp-ml-model The Main Idea We will develop a Machine Learning model, specifically a deep learning model (more hidden layers to come), to periodically, tactically rebalance the weights of our portfolio based on observable market data and empirically determined statistics combined with feature engineering from the past 21 trading days, and for the VIX we consider its characteristics since inception. The output will be a range representing the degree to which we bet long, short, or hold cash, and 3 weights that sum to less than or equal to one and greater than or equal to negative one. In essence we will allow shorting of securities and not require our portfolio to be fully invested. Cash is an active position; sometimes the best investment is staying on the sidelines. The model will allow one input layer, one and two hidden layers (to show that more might not always be better, explicitly with the 200 variable maximum excel solver imposes on us), and an output layer with 3 nodes outputting a value between -1 and +1 with -1 representing a full allocation to a short position in the security and +1 representing a fully allocated long position. submitted by /u/QFA_official [link] [comments]  ( 9 min )
    [D] Are people in ML Phds still happy?
    As an outsider who has many friends in ML Phds, this is my perspective of their lives: long hours, working nights, weekends no work-life balance, constant fear of being scooped and time pressure from deadlines frustrating broken review systems many incremental, advertisement papers that produce very little actual contribution (which is justified by 2.) "engineering" and not "science" all this pressure amounts to severe imposter syndrome Are people in the field still happy? Where do people get their satisfaction? To me it looks like almost like a religion or a cult. The select few who say, get neurips outstanding paper are promoted to stardom - almost a celebrity status while everyone else suffers a punishing work cycle. Are the phd students all banking on AGI? What else motivates them? Edit: the discussion is about whether 1-6 are worse in ML than other fields (or even the median experience). The reference for "other field" is highly heterogenous. Experience obviously varies by lab, and then even by individuals within labs. "It happens in other fields too" is a trivial statement - of course some version of 1-6 affects somebody in another field. Edit 2: small n but summarizing the comments - experience seems to differ based on geographic region, one's expectations for the phd, ability to exert work-life balance, and to some extent ignore the trends others are all following. Some people have resonated with problems 1-6, yet others have presented their own, anecdotal solutions. I recommend reading comments from those who claim to have solutions. submitted by /u/shenkev [link] [comments]  ( 9 min )
    [P] A PDF tool that supports three retrieval strategies, allowing users to choose the answer that suits them best
    ➡️ Check on https://huggingface.co/spaces/xuyingliKepler/VecDBCompare 📌 Introduction: VecDBCompare is a streamlit-based application designed to evaluate and compare three different vector database retrieval strategies. Users only need to upload a PDF and interact with QABots using three different strategies to determine which strategy is most suitable for them. ⭐️ Three retrieval strategies: Chunk Strategy: Divides the document into small chunks and retrieves based on the most relevant chunks. Summary Strategy: Summarizes the document and retrieves based on the summary content. Hypothetical Question Strategy: Generates hypothetical questions that the document might answer and retrieves based on these questions. submitted by /u/xuying_li [link] [comments]
    [D] [P] 3D Design file labelling and classification for manufacturing
    I have ~1 million 3D design (.STP and/or .OBJ) files of various parts for medical devices, aerospace, automotive or defense systems. I'd like to label them based on appropriate manufacturing methods that are used to physically make them. Some example methods and labels would be milling, turning, injection molding, cnc machining, etc. After labelling, I'd like to architect a system to produce these labels as inference for a new part that has not been physically made yet. My team (<5 people) have manufacturing domain expertise and can manually label these parts but I'm looking for a more scalable solution that isn't as time consuming. Crowd sourced methods like Mechanical Turk won't work because annotators do not have the domain knowledge to mark the correct label. Labelling platforms like SageMaker/Azure ML Studio only allow image/text/audio datasets, is there a platform that'll help me setup labelling tasks for 3D designs? Furthermore, how can I find more experts that can help scale this up? It seems to me that the only option is to build my own labelling app as an annotator needs these key features - 3D model visualizer so they can spin the part and view any orientation Draw a bounding box (commonly available in other platforms) Toggle measurements in inches/mm As for label classification I'm looking at architectures like PointNet since my dataset of meshes can be sampled to point clouds. Are there other methods that would work better or worth exploring? Open to any and all suggestions across this pipeline. ​ ​ submitted by /u/rootcage [link] [comments]  ( 9 min )
    [D] Undergrad seeking advice on ethics/ML research
    I’m an undergraduate who’s considering a PhD student in ML. I’m currently in a lab that focuses on ethics in AI. While I love the work, it focuses on the humanities side of CS. I’ve always been a more mathy person and have always been interested in theoretical ML research. I’d like to combine ethics & AI/ML in some way (eg studying explainable AI from the technical perspective). I was wondering what are some research areas that combine the two and if I don’t work in academia, what’s the market and job prospects like for someone who does this? submitted by /u/SnooChipmunks1902 [link] [comments]  ( 9 min )
  • Open

    Rewards in Montezuma's Revenge
    Hello all, I'm working on Montezuma's Revenge using the Gymnasium API. I wonder if there's anyone here that knows the numerical value of the rewards? And if so, how they are typically scaled down. ​ Thanks! ​ G_bes submitted by /u/G_bes [link] [comments]
    The N Implementation Details of RLHF with PPO
    submitted by /u/vwxyzjn [link] [comments]
    Creating a Custom Environment in Unreal Engine 5
    Hello, I would like to create my own environment (Maze), in which I would like to train my drone using reinforcement learning, I am kind of new and I don't know how can I set the state space, rewards, and if I would like to use BS3 for training then how can I connect the environment? And for the agent which is the drone, should i just do the AirSim build.cmd and take the agent from there and place the starting position flag or what? I am a bit lost and I can't find tutorials on how to do this, I'd appreciate it if you could provide some guidance. Thanks in advance. submitted by /u/Gabii99 [link] [comments]
  • Open

    Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain
    In today’s information age, the vast volumes of data housed in countless documents present both a challenge and an opportunity for businesses. Traditional document processing methods often fall short in efficiency and accuracy, leaving room for innovation, cost-efficiency, and optimizations. Document processing has witnessed significant advancements with the advent of Intelligent Document Processing (IDP). With […]  ( 20 min )
    T-Mobile US, Inc. uses artificial intelligence through Amazon Transcribe and Amazon Translate to deliver voicemail in the language of their customers’ choice
    This post is co-authored by Dhurjati Brahma, Senior Systems Architect at T-Mobile US, Inc and Jim Chao, Principal Engineer/Architect at T-Mobile US, Inc and Nicholas Zellerhoff Associate Systems Architect at T-Mobile US, Inc. T-Mobile US, Inc. provides a Voicemail to Text service to its customers, which allows customers to quickly read through their voicemails and […]  ( 7 min )
  • Open

    DSC Weekly 24 October 2023
    Announcements Top Stories In-Depth The post DSC Weekly 24 October 2023 appeared first on Data Science Central.  ( 21 min )
    Seamless integration of data from unconventional source systems into Business Intelligence using data science techniques
    Written by Venkata Nori and Kshitij Gopali. Introduction As technology is evolving, most companies in the world are adopting advanced mechanisms for their daily tasks of storing/updating data, project management & tracking, incident management, version control, etc. Periodically, these companies’ business stakeholders would want to extract and analyze the data to see how the business… Read More »Seamless integration of data from unconventional source systems into Business Intelligence using data science techniques The post Seamless integration of data from unconventional source systems into Business Intelligence using data science techniques appeared first on Data Science Central.  ( 25 min )
    How data science and medical device cybersecurity cross paths to protect patients and enhance healthcare
    A recent interview by Medical Device Network with GlobalData medical analyst Alexandra Murdoch shares interesting insights into cybersecurity for medical devices. The post How data science and medical device cybersecurity cross paths to protect patients and enhance healthcare appeared first on Data Science Central.  ( 22 min )
    Skills required to excel in a business analytics career
    In the contemporary business landscape, where data is heralded as the new oil, Business Analytics has emerged as a pivotal domain, steering organizations towards informed decision-making and strategic planning. business analytics encompasses the utilization of data, statistical algorithms, and machine learning techniques to comprehend the business context, forecast future trends, and facilitate optimal decision-making. The… Read More »Skills required to excel in a business analytics career The post Skills required to excel in a business analytics career appeared first on Data Science Central.  ( 22 min )
    GenAI: The game-changer in data analytics
    In an era where data drives decisions, GenAI emerges as a prodigy force in the realm of data analytics. According to Statista, LLM’s market size is expected to show an annual growth rate of 24%, resulting in a market volume of $207 bn by the end of 2030.  This cutting-edge technology, built on sophisticated algorithms… Read More »GenAI: The game-changer in data analytics The post GenAI: The game-changer in data analytics appeared first on Data Science Central.  ( 22 min )
  • Open

    Animated AI
    submitted by /u/nickb [link] [comments]
  • Open

    Best of N series
    A couple days ago I wrote about the likelihood of the better team winning a best-of-five or best-of-seven series. That is, if the probability of X winning a game against Y is p > ½, how likely is it that X will win a majority of 5 games or a majority of 7 games. This […] Best of N series first appeared on John D. Cook.  ( 6 min )
    Lessons from Skylab
    I discovered the Space Rocket History Podcast a while back and listened to all the episodes on the Apollo program. I’m now listening to the episodes on Skylab as they come out. I came for Apollo; I stayed for Skylab. I would not have sought out the episodes on Skylab, and that would have been […] Lessons from Skylab first appeared on John D. Cook.  ( 6 min )
    Curvature: principal, Gauss, and mean
    This post will compute the center of curvature for an object described in the previous post. In order to do that we first need to describe principle curvature and Gauss curvature, and we’ll throw in mean curvature while we’re at it. Let S be a surface sitting in three dimensional space. No need for more […] Curvature: principal, Gauss, and mean first appeared on John D. Cook.  ( 6 min )
    An algebraic superegg
    One year ago I wrote about a variant of the squircle that is quantitatively close to the customary definition but that has nicer algebraic properties. That post used the term p-squircle for the usual squircle with equation where p > 2, and the term s-squircle for the variation with equation where s is between 0 […] An algebraic superegg first appeared on John D. Cook.  ( 5 min )
  • Open

    On Razer’s Edge: VFX Star Surfaced Studio Creates Stunning Sci-Fi World This Week ‘In The NVIDIA Studio’
    Visual effects artist Surfaced Studio returns to 'In the NVIDIA Studio' to share his real-world VFX project, created on a brand new Razer Blade 16 Mercury Edition laptop powered by GeForce RTX 4080 graphics.  ( 8 min )

  • Open

    [P] Traffic signs in ecognition developer
    Hi community, first time posting here. I'm working on a project for the segmentation and classification of traffic signs using eCognition Developer software. I need help with creating scripts to apply three classifiers: Naive Bayes, SVM, and Random Forest. I'd like to know how I can implement these classifiers in eCognition Developer and where to insert the scripts in the software. Does anyone have experience with this software and could share script examples or provide guidance on how to accomplish this task? Sorry English is not my first language. Tldr, i need to include the Bayes classifiers, Random Tree, and SVM in eCognition Developer (for segmentation and classification - prediction). submitted by /u/Dignai [link] [comments]  ( 9 min )
    [P] Using gpt4docstrings to generate docstrings for entire projects
    gpt4docstrings is a Python library that allows you to write docstrings for functions / classes non documented in your codebase. In this case, I'm applying the library to one module of langchain to see the results. Repo: https://github.com/MichaelisTrofficus/gpt4docstrings https://i.redd.it/78f3wit071wb1.gif submitted by /u/Hefty-Consequence443 [link] [comments]  ( 9 min )
    [P] DQN with a binary vector as output
    Heey everyone! I hope you're doing well. I need your help guys. I'm working on a DQN that outputs a binary vector of length L (I just applied sigmoid function on the ouptut layer and take p>0.5 as 1 and 0 otherwise). In this setting, at each decision time, the agent returns a list containing the indices of selected elements. Knowing that the list's length is dynamic how can I train my DQN ? (I am facing issues in this). Is there any alternative way to do this purpose (like DDPG :/ )? submitted by /u/GuavaAgreeable208 [link] [comments]  ( 9 min )
    [Project] Looking for AI/ML engineers to team up for a fallow deer identification project
    Hi, first of all, sorry for the cross post, but I guess Huggingface forums were not the right place to begin with and it took me a while to find out where things about AI/ML are being actively discussed. I am a professional software developer (C, Python on Linux) and while I did try out a few things with PyTorch and Diffusers - I am not an ML engineer, so I am looking for someone with ML expertise who’d be interested to team up for a non commercial open source project. I can do quite a lot around application development, but I clearly lack the required ML knowledge. I followed the free MIT ML courses on YouTube, did some reading, tried things out, but the ML part of this project is for sure over my head. So, here’s what I have in mind: I would like to create an application which would b…  ( 11 min )
    [D] Using SQL to monitor ML models
    Hello, We are running a number of machine learning models in production and would like to monitor some metrics during inference: Data quality, inference time, accuracy, etc. All these metrics could be recorded in the python code and we are planning to build a SQL database that will receive all the information so as we can visualize in grafana. Do you think this is a good pattern? What would you suggest instead (we are using AWS). Thank you in advance. submitted by /u/Eddas123 [link] [comments]  ( 9 min )
    [R][P] Trying to understand the generative properties of autoencoders
    A while back, I came across the "From Variational to Deterministic Autoencoders", which provided a novel insight into the generative properties of autoencoders by framing the objective through the lens of regularization. However, I couldn't help but notice that the deterministic models studied felt incomplete, namely due to the inherent lack of sampling in those models (which is something that the authors acknowledge). To provide a short recap of the paper, the authors surgically decompose the variational autoencoder objective into a deterministic one. They start with a Constant-Variance VAE, which is a special case of the general Gaussian latent VAE where the noise standard deviation of the latent distribution is fixed to 1. This leads to what is essentially a standard autoencoder with t…  ( 10 min )
    [D] What is the lowest possible loss for a language model?
    Example: Suppose a character-level language model (three input letters to predict the next one), trained on a dataset that contains three instances of the sequence aei, with two occurrences preceding o and one preceding u, i.e., the dataset is: Input Output aei o aei u aei o In this case, the ideal probability distribution for the model's logits for aei would be ~0.66 for o, ~0.33 for u, and zero for other letters. In other words, when the model is input with aei, the ideal softmax of the logits would be ~0.66 for o, ~0.33 for u, and zero for other letters. Following this reasoning, the objective is to optimize the model's output for a given input to match the distribution of occurrences in the dataset. If this reasoning is correct, then we have the following ideal loss (cross-entropy): https://preview.redd.it/pzpxogcqd0wb1.png?width=330&format=png&auto=webp&s=b0b6c3b5fbfb4797c11a1f26375065ce883551d3 Thus, ~0.63 is the smallest loss we can get with this dataset. Is my reasoning correct? submitted by /u/viniciusarruda [link] [comments]  ( 9 min )
    [D] Tanh activation function outputs the same value for any given input
    Basically im working on the DDPG algorithm in DRL where i have an actor and critic networks. The actor network architecture is quite simple: Input layer contains 22 neurons that represents the state values (ranging from 0.1 to 10.0 max not normalizing them) Two hidden layers with 128 neurons, with Leaky Relu activation (alpha = 0.01), and with HeUniform kernel initialzer Output layer with a single neuron has tanh activation, using Glorot kernel initialzer The critic network has the same architecture but we only concatenate the 22 state values with the action produced by the actor, the only difference is the ouput of the critic has no activation. And both networks use Adam. The problem arises when the training starts because i run a few steps without actually start the learning, but when the learning starts, the actor converges quickly to output values 1 or -1 afor any given input. I tried many learning rates for both actor and critic. One thing to note is when i set the actor learning rate to 1e-5 and the critic to 1e-3 the networks sometimes converges quickly, some time it takes longer to converge and sometimes it does not converge. submitted by /u/Desert_champion [link] [comments]
    [P] Fine-tuning VAEs on limited data
    I have been looking for a pre-trained VAE (on Imagenet with ResNet/VGG) or similar which I could fine-tune on my smaller dataset. However, not only there does not exist many such pre-trained weights but the practice of fine-tuning VAEs does not really seem mainstream. Is there a reason why VAEs are not pre-trained/fine-tuned? Does it have to do with posterior collapse? submitted by /u/unholy_sanchit [link] [comments]  ( 9 min )
    [D] Smart pooling for Visual Transformers
    There is an architecture for images/videos called MViT, where 2D MaxPooling layers are added to reduce computations for ViT. But MaxPooling has a drawback - it discards information independently of context, equally discarding information from both important and uninformative parts of the image. For traditional Conv2D networks, there's little we can do about this, but for transformers, we can reduce dimensionality in a more meaningful way - discarding only those elements that don't carry unique information. Are there any articles/developments on this topic already? submitted by /u/Dependent_Bluejay_45 [link] [comments]  ( 9 min )
    "[Research]" RVC AI Training
    Hello, I'm currently using RVC AI, and I'm about to record myself for the training. What is the best way to record myself except the singing and talking at least for 15 minutes like the guide says. Do I have to make it 20 min and one audio file or do I have to make it 20 min and maybe 10 files with 2 minutes each file? Also, can I multiply my files and reach the 15-20 minutes of audio that it's required or I have to make a different talking or singing for every audio? submitted by /u/WeldFrenzy [link] [comments]  ( 9 min )
    [D] RAG oriented fine-tune... Searching for coherence
    Still searching for a model that is well enough to make RAG... Lots of good models on huggingface, but none of them is trained to return extracted text or answers based on provided info without hallucinating something. Is quite frustrating, every week came a new version of a model that is amazing for Role play and storytelling... (some good progress also on coding...) I see lots of efforts in different RAG strategy, improving semantic search and Chunking, but the open source community still does not have a decent model fine tuned for that. I have considered the idea of make that fine tune, based on synthetic data (using Wikipedia as knowledge base), but unfortunately I have not enough funds to cover the api cost neighter to pay for some decent Gpu. I'm not going to train a 7B Model because the under 30B imho doesn't have many sense if the coherence is the main requirements. Unpopular opinion: as coherence, code llama 34B is much better to any of the 70B fine tune. Sorry to everyone for the rant... Does anyone have some tips or suggestions? Thanks in advance! Edit: My database is composed mainly by abstracts of papers and medical textbook. I admit that the domain is quite complex, but the error rate is too high. Obviously that even if prompted to avoid that (tried and refined multiple prompts, using different prompt format). Gpt3.5, Claude instant and Palm2-Bizon work fine for that task. (obviously GPT4 and Claude 2 would be best, but too expensive for me) I spent lots of time to make a solid embedding pipeline: advanced chunking, Metadata added by llm, text for similarity search different from text provided to LLM, instructor bi encoder to generate embeddings(INSTRUCTOR-XL), reranking using cross encoder, RAG-Fusion using multiple query and HyDE approach Hybrid search with BM25 So... I'm a bit frustrated that i can not run all locally, became that is a must for my project. submitted by /u/Distinct-Target7503 [link] [comments]  ( 10 min )
    [R] 2x the context length of ALiBi through position interpolation
    https://arxiv.org/abs/2310.13017# Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream summarization and retrieval tasks. submitted by /u/jwan584 [link] [comments]  ( 9 min )
    [D] How to make research publication more reproducible?
    As context, I'm personally working on a project to make ML/AI research publication more reproducible. We're backed by Balaji Srinivasan (https://twitter.com/balajis) at the level of funding and advice. It seems like, despite attempts like Jupyter Notebooks or sites like Papers with Code, most published research in ML still isn't setup to be easily reproducible. Even companies like Anthropic/OpenAI don't put much of an emphasis on reproducibility, even though it's in their interest to do so to earn public trust. Our current hypothesis is to conceptualize reproducible research as software testing. Specifically we're thinking of building tools that let you internally test the robustness of results, and externally publish them s.t. they're reproducible. You can think of it as continuous integration for reproducible research; e.g. BuildBot for Reproducible Research. One specific idea I have is to build a model evaluation/testing platform that lets you: Internally eval LLM models on open benchmarks (TruthfulQA, AGIEval, etc.) Test robustness of results under different assumptions Externally publish reproducible results I don't have a background in ML research. So I'm looking to get input from research engineers on what challenges/barriers currently exist with model testing and publishing reproducibly — so I thought I'd reach out in this community if anyone's open to that! Let me know if this post doesn't conform to the rules, or if this should go somewhere else. submitted by /u/manveerbasra [link] [comments]  ( 9 min )
    [P] Image Captioning Model
    Hello everyone, I am currently trying to find suitable image captioning and visual question answering models to implement in my project. After a quick google search I came across BLIP2 from hugging face however, its a very large model overall and both my pc and colab could never load its lightest pretrained version. Does anyone know any similar pretrained models for the specific tasks or any other way to load this kind of large model? (I tried loading it with 8bit precision which still failed) I have 16gb of RAM and the task requires image captioning and the ability to ask the model details about the specific image. Any help is greatly appreciated!! submitted by /u/Spitefulsalamander [link] [comments]  ( 9 min )
    [D] Episodic Training vs. Random Sub-Sampling in Few-Shot Learning
    I'm new to few-shot learning and I'm having trouble understanding why prototypical networks use a random sub-sampling approach while the vanilla few-shot learning approach uses episodic training. Doesn't random sub-sampling fail to guarantee that data overlapping won't occur? submitted by /u/The_Aoki_Taki [link] [comments]  ( 9 min )
    [D] High-temperature softmax
    I implemented a label propagation algorithm which is mainly used in the field of Video Object Segmentation (VOS). Basically I provide the labels for one frame and ask my model (using pre-trained encodings of frames) to do semantic segmentation on all the other frames of a video. I am obtaining consistently better results using an high temperature softmax when computing the similarity between pixels of different frames. Then the top-k similarities of each pixel (features) are used to propagate the labels from one frame to the next. I will not disclose the dataset I am using but let's say it is noisy (let's say also low quality). I want to understand why an high-temperature softmax performs better than a softmax with T=1 or an extreme T = 0.01. At the moment I get better results with T = 10, 100 and the trend in my grid search shows that even higher T could be possible. I was wondering if the model is still considerable valid if T is too high. I feel like the model is almost randomly guessing, if T is too high, but this apparently enhances performance. Every help is appreciated. Also literature about the topic! I only found one paper (which uses an high-temperature softmax to distill knowledge in a student-teacher network for remote sensing imagery) submitted by /u/darthjeio [link] [comments]  ( 9 min )
    [D] Callbacks in tensorflow v1
    Hi everyone, I have some old code written in tf1. It has not been ported to tf2 or pytorch yet. Does anyone of you have leads on whether one can implement custom callback for tf1 code and if there are any examples on the web? Thanks in advance. submitted by /u/wrik003 [link] [comments]  ( 9 min )
    [N] CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
    In the tech report of GPT4, an analysis was conducted on the impact of different languages on model performance. These effects are attributed to the amount of data and language characteristics. This also indicates that the model's effectiveness may not meet the expectations of users in different languages. The problem addressed in this paper is of significant importance. https://preview.redd.it/s48419fe9yvb1.jpg?width=2748&format=pjpg&auto=webp&s=ba76f1bd18043c6cb2610ed90f5c41a78b5ccd95 Arxiv: https://arxiv.org/abs/2310.13683v1 Stay updated with AI in a fun-to-listen way. Check out ai-dailynews.com to generate your personalized news podcast🎙. It's one of my open-source projects and takes no charge. submitted by /u/xuying_li [link] [comments]  ( 9 min )
    [N] Neural-Base Music Generation for Intelligence Duplication
    The paper employs a deep learning system to learn from the great composer Beethoven and capture his composition ability in a hash-based knowledge base. This new form of knowledge base provides a reasoning facility to drive the music composition through a novel music generation method. https://preview.redd.it/l9gzcoe38yvb1.png?width=1944&format=png&auto=webp&s=d6c5ca7f8fe434be1187c1f0440c5a94ebfc9b64 Arxiv: https://arxiv.org/abs/2310.13691v1 For more AI updates, check out this AI-generated news podcast🎙 tailored to your preferences(ai-dailynews.com), which is open source and free. submitted by /u/xuying_li [link] [comments]  ( 9 min )
    [D] Biclustering with the same row and column clusters
    The biclustering algorithm partitions rows and columns of a matrix into clusters so that the variance inside each intersection between row and column clusters in minimized. I want to perform the biclustering of a matrix, but additionally to enforce that the row and column clusters are the same, i.e. if the row i lies inside a row-cluster c then the column i must lie in a column-cluster c. Rows and columns in the matrix represent the same entities (but the matrix is non-simmetric). sklearn implementation does not support such a constraint. Are there any algorithms for this at all? submitted by /u/Tomarchelone [link] [comments]  ( 9 min )
    [D] Referenceless NLP Evaluation
    Hey all, I'm building this open source project that helps ML engineers evaluate LLM applications (its like unit testing for LLMs), and it works great in development since users can just write a test_file.py like how you would normally do it in pytest, but as I'm going onto the next phase I'm thinking how to bring evaluation to production, especially on metrics such as factual consistency where I need a ground truth. I'm hoping to get some ideas around this. Here's a link to the repo (https://github.com/confident-ai/deepeval) if you want more clarity on what the package looks like, but most importantly any help to brainstorm production evaluation will be greatly appreciated. Thank you very very much! submitted by /u/Ok_Constant_9886 [link] [comments]  ( 9 min )
    [D] Is Computer Vision dead? - “Quo Vadis, Computer Vision?”
    In ICCV23, several top notch researchers shared their insights (in a workshop called “Quo Vadis, Computer Vision?”) wrt the current state of Computer Vision, especially in light of the meteoric raise of LLMs. Has CV stalled? Is CV dead? E.g.MIT’s professor Bill Freeman, has some interesting points on foundation models: “FM aren’t fundamental, therefore not stable". Jitendra Malik argues "video can describe the world better than text." submitted by /u/btcmx [link] [comments]  ( 9 min )
    [R] Biologically plausible vision models for classification and grasping tasks
    Hey everyone! I am looking for papers that propose or explore biologically plausible vision models, primarily tasks like classification and grasping (predicting grasping bounding boxes) tasks. By biologically plausible, I mean papers that propose models inspired by the human brain in some way or the other. I know convolution is loosely inspired by human cognition, but everything I can find seems to suggest the opposite for ViT like models. I have come across certain papers like these: - https://arxiv.org/abs/1901.00945 - https://proceedings.neurips.cc/paper/2020/hash/98b17f068d5d9b7668e19fb8ae470841-Abstract.html But I am still looking for more. Any suggestions? submitted by /u/Far_Clothes_5054 [link] [comments]  ( 9 min )
    [D] Understanding the math behind diffusion models
    I was trying to comprehend the math behind this paper: https://arxiv.org/pdf/2006.11239.pdf. You can see in the equation corresponding to the forward diffusion process, at each time step, the image in the previous step is also scaled by sqrt(1-beta_t) while adding noise. It seems like the purpose of this is to maintain a fixed variance (or specifically, unit variance) at each time step. My question is: What is the significance of maintaining unit variance at each time step? Why is this useful? I saw somewhere that this is done to prevent the variance from "exploding." I don't really know what this means. I guess the variance keeps on increasing if the scaling isn't done. But why is this bad? submitted by /u/fallendeviL701b [link] [comments]  ( 9 min )
    [D] Neural Attention - One simple example that explains everything you need to know
    submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [D] Has anyone tried deploying FastAPI v2 with a BERT model on the NVIDIA Triton Inference Server?
    I'm not sure how to enable BERT with flash attention during the start-up of the Triton server in order to accelerate inference. Dao(the author of FA) told me he’s never tried. submitted by /u/g14loops [link] [comments]  ( 9 min )
  • Open

    Etsy Taking Stores Down as it's Bot Can't Tell Which Mockups are Real and Which ones are AI Generated
    If you are an Etsy seller or know someone who sells on Etsy, or maybe you went on Etsy and your favorite store is gone, could be due to the Etsy bots taking down stores for not figuring out properly which Mockup Images are real and which ones are AI Generated. All you have to do to find this out is go on youtube or social media and look for "etsy mockups news". Also Etsy has been pretty quiet about this and as a result Etsy sellers are going crazy about this as no one knows why some stores who haven't used AI to create their mockups are being targeted by these bots. This just goes to show how hard is getting to distinguish between what is real and what is AI generated and how across all industries companies are having issues adapting to AI technology changes. Thoughts? submitted by /u/fk1220 [link] [comments]
    New data poisoning tool lets artists fight back against generative AI
    Nightshade is a new data poisoning tool that allows artists to fight back against generative AI models. By adding invisible changes to the pixels in their art, artists can cause chaos and unpredictable results in AI models that use their work without permission. The tool, called Nightshade, is intended as a way to fight back against AI companies that use artists’ work to train their models without the creator’s permission. Using it to “poison” this training data could damage future iterations of image-generating AI models, such as DALL-E, Midjourney, and Stable Diffusion, by rendering some of their outputs useless—dogs become cats, cars become cows, and so forth. AI companies such as OpenAI, Meta, Google, and Stability AI are facing a slew of lawsuits from artists who claim that th…
    I would like to upload 100+ one-hour-long podcasts in MP3 and get a 1-page summary of the most important points discussed in each episode — what's the best way to go about doing this?
    ChatGPT and Bard are cool, but I have to manually feed them transcripts generated by Whisper to get summaries. Furthermore, since the length of the transcript is often longer than the maximum character limit(s), I have to add additional prompts in between copying and pasting multipart transcripts. Since these recordings are 10–15 years old, the audio quality isn't the best, but I think it's sufficient to generate transcripts + detect speech, if not, I might need an additional "audio cleaning" step as well. I don't mind paying, and I'm above average in technical ability, so if anyone has any suggestions, I'd love to hear them. Here's what the workflow would look like: INPUT: I will upload a folder containing 100+ MP3 files of podcasts with below-average audio quality. OUTPUT: I would like to get a Google Doc or a Text file with 1-page summaries of the most important points in bullet-point format corresponding to each episode. Each page should be separated by some sort of divider, and the header should contain the filename for reference. Ideally, there should be an existing Jupyter Notebook I could throw in Google Colab and do all of the above in a plug-and-play manner, but if not, I'd love to hear your thoughts. Any tips? Thanks! submitted by /u/aknalid [link] [comments]
    The dilemma of potential AI consciousness isn't going away - in fact, it's right upon us. And we're nowhere near prepared. (MIT Tech Review)
    https://www.technologyreview.com/2023/10/16/1081149/ai-consciousness-conundrum/ "AI consciousness isn’t just a devilishly tricky intellectual puzzle; it’s a morally weighty problem with potentially dire consequences. Fail to identify a conscious AI, and you might unintentionally subjugate, or even torture, a being whose interests ought to matter. Mistake an unconscious AI for a conscious one, and you risk compromising human safety and happiness for the sake of an unthinking, unfeeling hunk of silicon and code. Both mistakes are easy to make." "Every expert has a preferred theory of consciousness, but none treats it as ideology—all of them are eternally alert to the possibility that they have backed the wrong horse." "The trouble with consciousness-­by-committee, though, is that this state of affairs won’t last. According to the authors of the white paper, there are no major technological hurdles in the way of building AI systems that score highly on their consciousness report card. Soon enough, we’ll be dealing with a question straight out of science fiction: What should one do with a potentially conscious machine?" "For his part, Schwitzgebel would rather we steer far clear of the gray zone entirely. But given the magnitude of the uncertainties involved, he admits that this hope is likely unrealistic—especially if conscious AI ends up being profitable. And once we’re in the gray zone—once we need to take seriously the interests of debatably conscious beings—we’ll be navigating even more difficult terrain, contending with moral problems of unprecedented complexity without a clear road map for how to solve them." submitted by /u/kamari2038 [link] [comments]
    The Future of AI Voice Technology
    submitted by /u/Amandacerni [link] [comments]
    UK officials use AI to decide on issues from benefits to marriage licences
    submitted by /u/sky_badger [link] [comments]
    One-Minute Daily AI News 10/22/2023
    A new AI agent Eureka developed by NVIDIA Research that can teach robots complex skills has trained a robotic hand to perform rapid pen-spinning tricks — for the first time as well as a human can.[1] Meta’s Habitat 3.0 simulates real-world environments for intelligent AI robot training.[2] South Korea’s SK telecom Co. will collaborate with Deutsche Telekom AG to jointly develop a telecommunications-specific artificial intelligence (AI) large language model (LLM) as competition intensifies among local telecom companies to expand overseas with their own AI capabilities.[3] Scientists say they have built an artificial intelligence (AI) tool that can successfully identify and confirm supernovas.[4] Sources: [1] https://blogs.nvidia.com/blog/2023/10/20/eureka-robotics-research/ [2] https://siliconangle.com/2023/10/20/metas-habitat-3-0-simulates-real-world-environments-intelligent-ai-robot-training/ [3] https://pulsenews.co.kr/view.php?year=2023&no=810112 [4] https://learningenglish.voanews.com/a/researchers-build-first-tool-to-discover-supernovas/7318435.html submitted by /u/Excellent-Target-847 [link] [comments]
    How To Earn $1M+ By Using AI To Write Books
    I've been using ai for a long time, it often helps me to reduce my work time, but I want to try to earn money and decided to make an investigation. I want to hear your opinion on my analysis, and maybe this post will help someone in starting a business through ai Joe Popelas, a very young entrepreneur, has made over a million dollars within the last year selling AI-generated books online. I literally got fascinated by how simple yet powerful it is with these tools to create a book within a matter of a few hours. Joe Popelas is one of a new breed of AI entrepreneurs who capitalized on the democratization of large language models. Joe's story demonstrates the power of combining human creativity with AI. While AI tools did the heavy lifting for his initial drafts, Joe spent time refining …
  • Open

    From text to dream job: Building an NLP-based job recommender at Talent.com with Amazon SageMaker
    This post is co-authored by Anatoly Khomenko, Machine Learning Engineer, and Abdenour Bezzouh, Chief Technology Officer at Talent.com. Founded in 2011, Talent.com is one of the world’s largest sources of employment. The company combines paid job listings from their clients with public job listings into a single searchable platform. With over 30 million jobs listed […]  ( 12 min )
  • Open

    Street View to the Rescue: Deep Learning Paves the Way to Safer Buildings
    Images such as those in Google Street View are taking on a new purpose in the hands of University of Florida Assistant Professor of Artificial Intelligence Chaofeng Wang. He’s using them, along with deep learning, in a research project to automate the evaluation of urban buildings. The project aims to help governments mitigate natural disaster Read article >  ( 6 min )
  • Open

    How to properly evaluate competitive MARL?
    Hello, everyone! I'm building a MARL agent for a zero-sum game and I'm having a hard time evaluating it. I managed to quickly train it for a simple case and I could manually verify that it was actually learning the optimal decision making because I already know how the game works and, for this simple case, I know that there actually is a mathematically correct way to play it (from both sides) and how it should be played, but that isn't true for most cases (and even if it was, I wouldn't be able to manually verify thousands of games). To complicate things even more, there are billions and billions of possible initial states. For single-agent RL, I could set a reward threshold (if I knew which was the maximum reward possible) or at least I could set a maximum time of "no improvement" but, in a zero-sum game, the sum of the policy rewards is, well, zero. I could think of two solutions: Evaluate convergence to Nash Equilibrium on a subset of the possible initial states, which could be a problem because I'm not sure if the game dynamics guarantee the existance of Nash Equilibria; Evaluate convergence of the winrate of the trained agent against a "hand-crafted" baseline agent, which could be a problem because the quality of this evaluation method could depend on how well I can make this baseline agent (which won't be even close to optimal, otherwise I wouldn't be training an agent). Any thoughts? submitted by /u/victorsevero [link] [comments]
    Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
    submitted by /u/gwern [link] [comments]
    In your opinion, which is the most beautiful form of the Bellman Equation and why?
    Didn't see anything about this kind of post in the rules I'm asking for a tattoo idea haha submitted by /u/victorsevero [link] [comments]
    Inverted pendulum swing-up problem not converging to global optimum using SAC or TD3.
    I am making a thesis about using RL to solve the inverted pendulum swing-up problem. I have tried using TD3, SAC, and TD3-Fork. In my testing, TD3-Fork worked best, I think SAC would also work if I am able to tune the hyperparameters correctly. I would like a similar trained agent to td3 converged where the agent balances the pole almost indefinitely. I have tried the hyperparameters from the website and also different hyperparameters but it has not converged. I am wondering if I am missing something or if there is anything I can do to improve the agent. I have been thinking of using HER instead of FORK. Any help or advice would be appreciated. training reward data The 'maximum' reward that I could get in the simulation is >880. The reward function that I used is -[cos(theta) + 10(|x| > 0.9) + 10(|theta_dt| > 18)]. However, from the data above it only converges to about 837 max and rarely reaches >900. trained td3 fork agent submitted by /u/YEEETTT0708 [link] [comments]
    Godot enables me to do pure C# Deep reinforcement learning.
    submitted by /u/Vae94 [link] [comments]
    [R] Demo of “Flow-Lenia: Towards open-ended evolution in cellular automata through mass conservation and parameter localization” (link to paper in the comments)
    submitted by /u/gwern [link] [comments]
  • Open

    Celebrating Kendall Square’s past and shaping its future
    The 15th Kendall Square Association annual meeting explored new and old aspects of the neighborhood.  ( 9 min )
  • Open

    Nonlinear algebra
    What is nonlinear algebra? Negations are tricky. They may be the largest source of bugs in database queries. You have to carefully think about what exactly are you negating. Any time you see “non-” attached to something, you have to ask what the context is in which the negation takes place. For example, if you […] Nonlinear algebra first appeared on John D. Cook.  ( 6 min )
  • Open

    Are Generalized Self-Supervised ViT Models the Image Objective Counterpart of LLM’s?
    submitted by /u/No-Platypus4021 [link] [comments]
    Neural Networks: A Deep Dive into AI's Building Blocks
    submitted by /u/Emily-joe [link] [comments]
  • Open

    Abstracts: October 23, 2023
    Today on “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output. The post Abstracts: October 23, 2023 appeared first on Microsoft Research.  ( 16 min )

  • Open

    [P] Having GPT-4 Iterate on Unit Tests like a Human
    Hi r/MachineLearning, My name is William and I’m one of the founders of Sweep. Sweep is an AI junior developer that writes and fixes code by mirroring how a developer works. While building Sweep, we used to use the Github API, but we ran into rate limits, so we changed this to clone your repository for the duration of the request. It's now coming full circle. Sweep can now write, run, and debug a failing unit test for the ClonedRepo class! Blog: https://docs.sweep.dev/blogs/ai-unit-tests Video: https://www.youtube.com/watch?v=N9PUxmja9z4 submitted by /u/williamsweep [link] [comments]  ( 9 min )
    [D] Structured learning resources for ML Theory
    So essentially what the title says. I want to truly understand whats happening behind Machine Learning in general and also behind each algorithm specifically (starting from the basics to more advanced things, like Logistic Regression, Decisions trees and random forests, Deep Learning, NLP, GANS...). By structured I mean it contains all the pieces ordered and organized, from the same source, so you can can actually go from the building blocks up, not just a YouTune channel that uploads interesting videos about different machine learning related topics. Regarding the medium, I don't really mind but I would prefer audiovisual content (YT channel/playlists, Lectures, conferences...) but if you really recommend a specific book or series of books that's also okay. If it has some practical focus to it (to better grasp the theory) that would great. Also, I would prefer if it goes deep into the details, but not too deep into the specific maths involved, but if it's the case thats also okay. Regarding price, obviously if it's free that would be awesome, but in the range of free to 40€ is fine. Thank you for your recommendations in advance!! submitted by /u/aleradamantis [link] [comments]  ( 9 min )
    URL PHISHING OR BENIGN USING DEEP LEARNING "[Research]", "[R]", "[Project]", "[P]"
    Guys does anyone have an idea why my model does not work and it's like 50-50 chance to get it right. I'm getting really frustrated. Here is the code so far: ​ import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader from collections import Counter from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score import re from imblearn.under_sampling import RandomUnderSampler # Loading the data file_path = "C:/Users/alex/Desktop/DATASET/malicious_phish.csv" data = pd.read_csv(file_path) # Filtering data filtered_data = data[data['type'].isin(['phishing', 'benign'])] # Undersampling the majority class rus = RandomUnderSampler(rand…  ( 12 min )
    [D] - Pre-Training a 4bit model (NOT Fine-tunning)
    Pre-Training using 4bit (NOT fine-tunning) Hello community! I have been messing around with open source LLM's running them locally using peft and AutoGPTQ in Transformers. I even trained a few QLora models (my favorite part) However my question is this, given the performance of a 4bit model why hasn't there been any research in this area? Is it possible to even create a new model using 4bit altogether? I am sure it's not as easy as it sounds but I haven't seen anyone try. Just curious cause it will open doors for many of us with consumer grade hardware. Thanks! submitted by /u/Delicious-Farmer-234 [link] [comments]  ( 9 min )
    [P] Infinity, a FOSS project for supporting RAG for LLMs and Vector Embeddings.
    https://github.com/michaelfeil/infinity Infinity, a open source REST API for serving vector embeddings, using a torch / ctranslate2 backend. Its under MIT License, fully tested and available under GitHub. I am the main author, curious to get your feedback. FYI: Huggingface launched a couple of days after me a similar project ("text-embeddings-inference"), under a non open-source and non-commercial license. submitted by /u/OrganicMesh [link] [comments]  ( 9 min )
    [R] Combining Thermodynamics and Diffusion Models for Collision-Free Robot Motion Planning
    Researchers from Yonsei University and UC Berkeley recently developed a new AI method for enabling autonomous robots to navigate unfamiliar environments filled with obstacles using only visual data as input. The key innovation is a customized diffusion model. Diffusion models can generate diverse motion plans by adding controlled noise. The researchers tailored the model to mimic how heat avoids insulation when dispersing through space. Similar to heat navigating around insulators, this "collision-avoiding" diffusion model learns to predict robot motions that avoid collisions with obstacles. It generates reachable goals and viable motion plans to those goals simultaneously. In simulations, this approach achieved ~98% success rates in navigating to target destinations while avoiding randomly generated obstacles using only visual map images as input. While extensive real-world testing is still needed (only 2D, only simulation), these initial results showcase promising capabilities: Enables navigation in unfamiliar environments without pre-mapping. Flexibly identifies and progresses toward reachable goals. Avoids unnecessary sensing systems for obstacle avoidance. Learns complex collision avoidance heuristics from visual data. I like the thermo + AI + robotics combination here - takes me back to my days in aerospace engineering. Pretty interesting approach. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [R] Speeding up open source LLMs with speculative decoding
    submitted by /u/firef1y1 [link] [comments]  ( 9 min )
    [P] Open Source AI repos that caught my 👀 this week
    @MetaGPT_ github.com/geekan/MetaGPT - multi agent collaboration - MetaGPT encodes Standard Operating Procedures (SOPs) into prompts. The claim is that it takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. @Ollama_ai github.com/jmorganca/olla… - run large language models locally. The future of AI/LLMs may not be on the cloud, but on your own laptops/mobiles. ollama.ai/blog/building-… @huggingface github.com/huggingface/ca… - slick ML framework for Rust with a focus on performance (including GPU support) @remilouf github.com/outlines-dev/o… - helps developers guide text generation to build robust interfaces with external systems. Provides generation methods that guarantee that the output will match a regular expressions, or follow a JSON schema. github.com/YiVal/YiVal enterprise AI platform submitted by /u/oana77oo [link] [comments]  ( 9 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 9 min )
    [D] Teachers struggle to adapt amid AI revolution in education
    submitted by /u/DutchTechJunkie [link] [comments]  ( 9 min )
    [R] Language interaction to assist in composing and refining music
    Hey guys, I found an interesting paper recently. Universities in the UK introduced Loop Copilot, enabling users to generate and iteratively refine music through an interactive, multi-round dialogue interface. Using language interaction to assist in composing music is very appealing, AI makes a complex workflow easy and automated. https://preview.redd.it/nzvlgevcarvb1.jpg?width=998&format=pjpg&auto=webp&s=815e0f48c299831700215ebdb4257423e317f5ec submitted by /u/xuying_li [link] [comments]  ( 9 min )
    [D] How to account for extreme periods in time series forecasting?
    I am performing a (machine learning) time series forecast on monthly data from the last 20 years. If I separate my data into a train, validation, and test set, the validation set is almost completely filled with extreme values due to the Covid period. How to account for this? submitted by /u/Ambitious-Pay6329 [link] [comments]  ( 9 min )
    [P] Graphing emotion events with LMs for in-depth sentiment analysis
    submitted by /u/helliun [link] [comments]  ( 8 min )
    [D] DINOv2 Breakdown: I've Created a Visual Guide to the Model's Design & a Concise Code Walkthrough
    submitted by /u/CkmCpvis [link] [comments]  ( 9 min )
    [R] Do you read ML/DL/AI related scientific papers? How do you filter them?
    As the title says. Recently, I found a review paper where the authors showed an exponential growth of published papers related to ML or DL. I was wondering if you even read those. If yes what's your way to find good and reliable papers? Do you choose only ones with a significant number of citations? Or just strictly related to your field? If no, why not? https://preview.redd.it/jwjvej5f5qvb1.jpg?width=1080&format=pjpg&auto=webp&s=bf3f7e08e0fe09fe0c6a6fd8d194945b45f5858e submitted by /u/hahahaczyk [link] [comments]  ( 9 min )
    [R] Open-Source Projects on Detecting Landmines
    I know that there are a lot of efforts at the moment to improve the algorithms used for landmine detection. Is anyone aware of any ongoing open-source projects in this space? submitted by /u/Eightstream [link] [comments]  ( 9 min )
    [R] Demo of “Flow-Lenia: Towards open-ended evolution in cellular automata through mass conservation and parameter localization” (link to paper in the comments)
    submitted by /u/hardmaru [link] [comments]  ( 9 min )
    German researchers create DeepMB for faster, high-quality optoacoustic imaging [N]
    Researchers from Germany have developed DeepMB, a groundbreaking deep-learning framework enabling high-quality and real-time optoacoustic imaging via multispectral optoacoustic tomography (MSOT). With potentially transformative implications for health care, this innovation might redefine medical imaging standards. To stay ahead of developments in AI, look here first. DeepMB breakthrough DeepMB resolves the longstanding tradeoff between image quality and speed in medical imaging. The deep-learning framework uses a deep neural network for model-based reconstruction, allowing for fast, high-quality imaging. DeepMB can reconstruct images approximately 1000 times faster than conventional techniques, with virtually no loss in image quality. Impressive metrics and implications The researchers accomplished accurate optoacoustic image reconstruction in just 31 milliseconds per image by training the system to pairingly synthesize optoacoustic signals with ground-truth images. DeepMB promises to equip clinicians with immediate access to high-quality MSOT images, regardless of the patient's condition or scanned body area. The technology could extend to other imaging modalities, such as ultrasound, x-ray, and MRI, potentially changing how diseases are diagnosed and treated. Exciting prospects The development of DeepMB is a significant leap in optoacoustic imaging, promising to enhance healthcare outcomes. As DeepMB evolves, it could become integral to modern medical imaging, delivering high-quality results at previously unattainable speeds. (source) P.S. If you like this kind of analysis, I write a free newsletter that unpacks the most significant news and research in AI. Google, Meta, and OpenAI professionals are already subscribed submitted by /u/orthomax23 [link] [comments]  ( 9 min )
    [D] ForeCastNet. Neural PDEs perform global weather simulation 4 to 5 orders of magnitude faster than traditional numerical methods.
    submitted by /u/moschles [link] [comments]  ( 9 min )
    Data labeling service for keypoints / pose [D]
    I was previously using scale.ai but they have been extraordinarily slow. Does anyone have recommendations for services to label keypoints or pose? Bonus points if the labeling service is able to handle 3D / multi angle data coming from multiple cameras. I work in an academic lab and scale is <10k images per batch. submitted by /u/researchrig [link] [comments]  ( 9 min )
    [D] Need help with text-to-song diffusion model architecture
    Hey, I want to make a text-to-song diff model, but I can't figure out the architecture I have already prepared a dataset of about 5000 songs of different genres, artists. It only contains the lyrics, the genre and the song itself Do I understand correctly that I should just encode the text and genre into one vector using CLIP and hope that the model will directly follow it (not skipping words and lines), or should I somehow make timestamps in the dataset (when, where and what text is sung)? I was inspired by Chirp V1 submitted by /u/Head-Selection-9785 [link] [comments]  ( 9 min )
  • Open

    IBM's NorthPole chip runs AI image recognition 22x faster than current chips
    IBM has developed a chip called NorthPole that runs AI-based image recognition 22 times faster than current chips on the market. The chip uses a two-dimensional array of memory blocks and interconnected CPUs to process data quickly. However, it can only run specialized AI processes and not training processes or large language models. The researchers plan to test connecting multiple NorthPole chips together to overcome this limitation. Source : https://techxplore.com/news/2023-10-ibm-northpole-chip-ai-based-image.html submitted by /u/NuseAI [link] [comments]
    Email Ai
    is there a website or some Ai to help me clean my inbox, stop receiving emails from certain senders etc etc... I've heard about: Sanebox for keeping your inbox organized Mailbutler for gathering contact details and tasks EmailTree for creating AI-powered workflows But they are paid and I'm looking for free alternatives submitted by /u/JOTA-137_0 [link] [comments]
    Microsoft CEO Satya Nadella talks AI, closing the Activision Blizzard deal, and his best business decision so far
    submitted by /u/thisisinsider [link] [comments]
    Medical Student Question: Why aren't there any programs that do differential diagnosis for doctor?
    Based on input you have. This would be like an enterprise software level program I guess and you would input history and then through trawling through data locally it can generate diseases and probability patient has each disease based on data inputted Why doesn't something like this already exist? I am learning how to do differential diagnosis now and it seems use extremely rudimentary understanding of probability to diagnose things. You use clusters of symptoms and then use tests to eliminate stuff in the differential. It just seems like low hanging fruit that a program could do using tech we already have (I imagine LLMs will make it easier) submitted by /u/derpgod123 [link] [comments]
    Tried visualizing an entire script using Dall-E 3 and these are the results.
    https://preview.redd.it/vi9wx005ksvb1.jpg?width=1024&format=pjpg&auto=webp&s=75502abcae7f2337693175101cb3491b8647d70d Revived an old script and made some images for it using Dall-E 3, just to test out the workflow: https://docs.google.com/document/d/1yyWRRmd0ah5Z4u8_aNYSq9csJ8pccP24Dcs9brPHbzs/edit Was pretty fun and I think by the end I got much better at learning how to maintain the consistency between characters, direct shots, etc. -~- submitted by /u/Kulimar [link] [comments]
    Combing Thermodynamics and Diffusion Models for Collision-Free Robot Motion Planning
    Researchers from Yonsei University and UC Berkeley recently developed a new AI method for enabling autonomous robots to navigate unfamiliar environments filled with obstacles using only visual data as input. The key innovation is a customized diffusion model. Diffusion models can generate diverse motion plans by adding controlled noise. The researchers tailored the model to mimic how heat avoids insulation when dispersing through space. Similar to heat navigating around insulators, this "collision-avoiding" diffusion model learns to predict robot motions that avoid collisions with obstacles. It generates reachable goals and viable motion plans to those goals simultaneously. In simulations, this approach achieved ~98% success rates in navigating to target destinations while avoiding randomly generated obstacles using only visual map images as input. While extensive real-world testing is still needed (only 2D, only simulation), these initial results showcase promising capabilities: Enables navigation in unfamiliar environments without pre-mapping. Flexibly identifies and progresses toward reachable goals. Avoids unnecessary sensing systems for obstacle avoidance. Learns complex collision avoidance heuristics from visual data. I like the thermo + AI + robotics combination here - takes me back to my days in aerospace engineering. Pretty interesting approach. Full summary is here. Paper here. submitted by /u/Successful-Western27 [link] [comments]
    I upgraded my AI girlfriend… and now she remembers stuff about me..
    submitted by /u/spaceecon [link] [comments]
    Self-learning AI Movement Prediction: Beyond Airstriker Genesis to multi-directional predictions
    Quick update on my self-learning software experiment: Thanks to your feedback, I decided to test my prediction system on a newer tower-defense game from the Apple App Store (simply called ‘The Tower’). What's crucial to remember is that this system is not pre-trained and only learns from the current game it encounters - it starts with zero knowledge and learns exclusively from the game it's currently playing, building from the ground up without the use of deep learning or neural networks. In this game (unlike Airstriker which I’ve previously used), players don't control a spaceship or fire weapons (you play the game by ‘upgrading’ your weapons, etc.). It's simpler because there's only one type of enemy that always approaches the center, so the system cannot demonstrate its capabilities for differentiation in this case. But this simplicity presents some other interesting challenges: Enemies approach from all 360-degree directions, pushing the boundaries of the path prediction software. They overlap during explosions, demanding the system to separate them. There's also more visual clutter, including static lines and a non-black background. The system's predictive performance has been remarkably strong. I’ve put together an overlay video to visually demonstrate how the system learns and adapts in this new game. Note: If things don’t align perfectly in there, it’s due to my poor video editing skills… Your feedback is appreciated as always! submitted by /u/_timmah_ [link] [comments]
    A scary thought...
    Without us, artificial intelligence just becomes intelligence submitted by /u/cognaceast [link] [comments]
    AI RPG DALL-E 3
    submitted by /u/the_anonymizer [link] [comments]
    Could machine learning produce a "simple" AI algorithm that performs better than what a human programmer could create in a reasonable amount of time?
    Let me clarify what I'm asking through an example: Artificial Intelligence in videogames has failed to develop in any meaningful way over the past two decades, at least as far as the typical end-user is concerned, and nowhere is this more apparent than in strategy games. Whether we're talking about the 90's or today, AI opponents typically have to receive significant cheats in order to provide a challenging experience for the player. This is widely considered undesirable, can harm immersion or a sense of fair-play, and leads to the concept of "cheesing" the AI (exploiting obvious weaknesses in the AI logic, something which is sometimes necessary if an AI receives such strong bonuses that any strategy you might attempt against another human player would be impossible to execute successfull…
  • Open

    Stability of a superegg
    Three weeks ago I wrote about supereggs, a shape popularized by Piet Hein. One aspect of supereggs that I did not address is their stability. Looking at the photo above, you could imagine that if you gave the object a slight nudge it would not fall over. Your intuition would be right: supereggs are stable. […] Stability of a superegg first appeared on John D. Cook.  ( 6 min )
    Best-of-five versus Best-of-seven
    Suppose that when Team X and Team Y play, the probability that X will win a single game is p and the probability that Y will win is q = 1 − p. What is the probability that X will win the majority of a series of N games for some odd number N? We know intuitively […] Best-of-five versus Best-of-seven first appeared on John D. Cook.  ( 5 min )
  • Open

    How the Self Play algorithm masters Multi-Agent AI
    submitted by /u/AvvYaa [link] [comments]
    Mujoco RL Robotic Arm
    Hi everyone, I'm new to robotic arms and I want to learn more about how to implement them using mujoco env. I'm looking for some open-source projects on github that I can run and understand. I tried MuJoCo_RL_UR5 repo but it didn't work well for me, it only deployed a random agent. Do you have any recommendations for good repos that are beginner-friendly and well-documented? submitted by /u/satyamstar [link] [comments]
    Why does bellman equation converge?
    After multiple iterations the value function converge by bellaman updates (vale iteration algorithm). Can someone provide a intuitive reasoning why the value converges? submitted by /u/RaceCondition01 [link] [comments]
  • Open

    Replay game input with image classification
    TensorFlow Keras correcting camera horizon in AC Valhalla https://www.youtube.com/watch?v=ASy-2zOMj_Y submitted by /u/Kostiantyn-Dvornik [link] [comments]
    How I determine neuron layers and amount of neurons in?
    Hello, I’m newbie in neural networks and I wonder, how do people decide how many hidden layers there will be and how many neurons will be inside? What the logic behind? submitted by /u/Particular-Song-633 [link] [comments]
    Unboxing Neuro Symbolic Reasoning and Learning
    submitted by /u/Neurosymbolic [link] [comments]

  • Open

    Google, other search engines' use of generative AI threatens $68B SEO industry
    The rise of generative AI in search engines like Google threatens the $68 billion search engine optimization (SEO) industry. Generative AI tools like ChatGPT aim to provide direct answers to user queries, bypassing the need for users to click on search results. This could render SEO efforts useless and impact the revenues of SEO consultants and search engines. However, generative AI search engines still face challenges such as providing incorrect or plagiarized answers, and gaining user trust and loyalty. Search engines have been quick to experiment with generative AI to improve search results, with Google's Bard, Microsoft's Bing AI, Baidu's ERNIE, and DuckDuckGo's DuckAssist being examples of this approach. As the quality of AI-generated answers improves, users will have less incentive to browse through search result listings, impacting the revenues of SEO consultants and search engines. The SEO industry generated $68.1 billion globally in 2022 and was expected to reach $129.6 billion by 2030, but the emergence of generative AI puts the industry at risk of obsolescence. Generative AI search engines are still in their infancy and face challenges such as providing incorrect or plagiarized answers, limiting their trust and loyalty among users. However, with the resources available to researchers, it is safe to assume that generative AI models will improve over time, leading to the potential death of the SEO industry. Source : https://theconversation.com/why-google-bing-and-other-search-engines-embrace-of-generative-ai-threatens-68-billion-seo-industry-210243 submitted by /u/NuseAI [link] [comments]
    Experimented with Fully Automating TikTok Video Creation Using AI for a Month - Here's What I Learned
    Hi everyone, I recently undertook a personal project where I tried to automate the entire process of creating TikTok videos using various AI tools. The goal was to see how advanced we've come in terms of AI's capabilities in content creation and to explore the nuances of automating a traditionally 'human' task. Here's a brief breakdown: Scripting: Leveraged ChatGPT for generating video scripts. Voiceovers: Used ElevenLabs for lifelike voice narration. Video Creation: Employed a combination of StableDiffusion Animate & Replicate. Editing: Automated the editing process to sync with the AI-generated voiceovers. After setting everything up, I ran the system for a month, generating 3 videos daily. The results were intriguing and a mix of expected and unexpected outcomes. Would love to hear thoughts, feedback, or similar experiences from the community. Are there other creative ways you've seen or used AI in content creation? submitted by /u/General_crypto [link] [comments]
    AI RPG (Dall-E 3)
    submitted by /u/the_anonymizer [link] [comments]
    Thanks to AI, the future of programming may involve YELLING IN ALL CAPS
    The future of programming may involve human-like communication techniques, including yelling in all caps. OpenAI's DALL-E 3 AI image generator integrated into ChatGPT revealed internal prompts shared between the image generator and the AI assistant. The prompts included commands written in all-caps for emphasis. This shows that programming and communicating with computers may become more human-like in the future. Previously, programs used specialized data formats and APIs to communicate, but now large language models allow for cross-program interaction in conventional English. OpenAI trained GPT-4, the AI model used in ChatGPT DALL-E interface, on hundreds of millions of documents scraped from the web, which included instances of polite language and reactions to it. The use of all-caps in the DALL-E message is interpreted as emphasis, and the model pays more attention to capitalized sentences. In the future, programming and communicating with computers may involve more emphasis and human-like communication techniques. Source : https://arstechnica.com/information-technology/2023/10/thanks-to-ai-the-future-of-programming-may-involve-yelling-in-all-caps/ submitted by /u/NuseAI [link] [comments]
    Close up view of rain hitting dust.
    submitted by /u/IllustriousVideo6145 [link] [comments]
    Impressive
    submitted by /u/the_anonymizer [link] [comments]
    Singularity Pinball.
    submitted by /u/Philipp [link] [comments]
    One-Minute Daily AI News 10/21/2023
    This dating app SciMatch uses AI to find your soulmate by your face. Snap a selfie, and let the app do the rest.[1] The Biden administration is reducing the types of semiconductors that American companies will be able to sell to China, citing the desire to close loopholes in existing regulations announced last year.[2] Business Schools Are Adding AI Education Into The Curriculum.[3] Google Pixel’s face-altering photo tool sparks AI manipulation debate.[4] Sources: [1] https://www.foxnews.com/tech/dating-app-uses-ai-find-soul-mate-face [2] https://www.cnn.com/2023/10/18/tech/us-china-chip-export-curbs-intl-hnk/index.html [3] https://www.entrepreneur.com/business-news/business-schools-are-adding-ai-education-for-future-ceos/464054 [4] https://www.bbc.com/news/technology-67170014 submitted by /u/Excellent-Target-847 [link] [comments]
    Training AI to Play Pokemon with Reinforcement Learning
    submitted by /u/ShooBum-T [link] [comments]
    ChatGPT and Bard cannot solve every problem for you.
    My last post in this thread got almost 90k views, honestly I'm very happy that I was able to be so helpful. ​ One guy asked me why I couldn't give more details about what tools I use and what tools help me?:/ I decided to make the top 24 tools and describe what they are responsible for in 2 words. In order not to violate the rules of r/artificial I decided not to leave direct links to tools, so as not to violate the rules, as some tools can be paid, I left only links to 2 resources where I took this information, but they are fortunately free. YouTube Summaries → http://eightify.app 3D Animations → http://moviebot.io AI Assistant → http://zipzap.ai Prompts → http://wnr.ai How-to-videos → http://teachomatic.net Custom AI chatbots ➝ http://chatling.ai Remove Background ➝ http://unscreen.com Forms ➝ http://feathery.io Presentations ➝ http://beautiful.ai Learning ➝ http://albus.org Blog ➝ http://jasper.ai Videos ➝ http://descript.com Image ➝ http://tryleap.ai Resume ➝ http://mosaicml.com Grammar Check ➝ http://trinka.ai Meeting ➝ http://krisp.ai Video ➝ http://decoherence.co App development ➝ http://brancher.ai Design ➝ http://modiphy.com Coding assistant ➝ http://bito.ai Twitter assistant ➝ http://tweethunter.io Personal assistant ➝ http://chat.openai.com LinkedIn assistant ➝ http://taplio.com YouTube assistant ➝ http://vidiq.com I hope this is as useful to you as the first post I'm just sharing my experiences and observations in the field of ai. LIST AND SITE https://preview.redd.it/zgkra3plpgvb1.jpg?width=1068&format=pjpg&auto=webp&s=779003d65dfa70c58d50ad690a0e436c735cdaeb submitted by /u/PerceptionPlayful469 [link] [comments]
    Oracle loops in Nvidia's AI stack for end-to-end model development
    Oracle has partnered with Nvidia to bring Nvidia's AI stack to its marketplace, giving Oracle customers access to top-of-the-line GPUs for training models and building generative applications. Eligible enterprises can purchase Nvidia's DGX Cloud AI supercomputing platform and AI Enterprise software directly from the marketplace and start training models for deployment on the Oracle Cloud Infrastructure. Nvidia DGX Cloud offers a serverless experience for multi-node training of custom generative AI models, supporting near-limitless scale of GPU resources. Nvidia AI Enterprise helps teams accelerate the deployment of models to production, with features such as the Nvidia NeMo framework, Rapids, TensorRT LLM open-source library, and Triton Inference server. Oracle has been focused on industry partnerships for its AI efforts and has announced generative AI capabilities in its products and solutions. Source : https://venturebeat.com/ai/oracle-loops-in-nvidias-ai-stack-for-end-to-end-model-development/ submitted by /u/NuseAI [link] [comments]
  • Open

    Policy Evaluation
    I know that given a policy, I can find the value function using iterative policy evaluation. Can I, given the value function, find the policy? submitted by /u/MomoSolar [link] [comments]
    Question on advantage (re-)computation for PPO
    Hi, I've been re-reading the "What matters in on-policy reinforcement learning" paper (https://arxiv.org/abs/2006.05990), and noticed that they suggest to recompute advantages at the beginning of each epoch (choice C5, see section 3.5 and appendix B.1). I was wondering: if someone here had already tried this and seen a significant improvement (which is what the paper suggests) ? if it did not also suppose to recompute the value targets at the beginning of each epoch, which could lead to some sort of moving target issue ? Best, submitted by /u/Scrimbibete [link] [comments]
    In RL, how can we reward an action taken 5 steps ago?
    Let us say we are building a model that will learn how to play a computer game like DOTA or league of legends. If model for example, buys weapon A, and use the item's ability on opponent B, it should learn what damage it gives to opponent given the items opponent B is wearing. But we would have done a lot of other actions in between before being able to use that weapon to reward the model on what it does / how much damage it made. How does do you do delayed reward for specific action made X number of steps ago? Thank you. submitted by /u/oniongarlic88 [link] [comments]
    Zoomposium with Professor Dr. Petra Ritter: "The simulation of brains"
    Zoomposium with Professor Dr. Petra Ritter: "The simulation of brains" In another installment in our "Zoomposium Series" on the topic of "Brain Research", my colleague Axel Stöcker of the "Blog der großen Fragen" and I had the great honor and pleasure of conducting an interview with the very well-known and renowned German medical doctor and neuroscientist Professor Dr. Petra Ritter. In this context, Ms. Ritter became a co-founder and leader of the co-design project "The Virtual Brain", which is a component of the European Open Science Cloud (EOSC) and is "a neuroinformatics platform for simulating whole brain networks using biologically realistic connectivity". She is leading the development of a virtual research environment as a collaborative research platform for sensitive health data and head of the "German National Neuroscience Research Infrastructure Initiative (NFDI-Neuroscince)" and involved in the development of the "Health Data Cloud EBRAINS". Petra Ritter has been Johanna Quandt Professor and Head of the Section for Brain Simulation at the Department of Neurology with Experimental Neurology at Charité - Universitätsmedizin Berlin since 2017. There, Professor Ritter and her team are involved in the "Simulation of Brains". More at: https://philosophies.de/index.php/2023/09/17/die-simulation-von-gehirnen/ ​ https://preview.redd.it/937m7mtyvivb1.jpg?width=1000&format=pjpg&auto=webp&s=22d1a7576f2ebbe7904f0187bd7c0234df7ddb8f submitted by /u/philosophiesde [link] [comments]
  • Open

    [D] PRML reading buddy
    Hey there mates, I am a 3rd year PhD student, trying to break into good quality research (tired of trying different permutations ans combinations of X and Ys, and hitting dead end when things don't work, or worse -- being unable to explain why things work :D). I have recently decided to read PRML cover to cover (slowly) and do some of the exercises as well. Goal is to finish in 6 months (2 chapters per month). Is there anyone on a similar journey, would love to tag along and discuss nuances? submitted by /u/Zealousideal_Yak9131 [link] [comments]  ( 9 min )
    [D] What do you all think of these pearls of wisdom on “Doing Great Research”?
    About the latest Jason Wei’s tweet. submitted by /u/mildlyphd [link] [comments]  ( 9 min )
    [D] Which is the best physics engine for reinforcement learning??
    What are some of the best physics engine that we should be using to implement physics for complex reinforcement related tasks(like humanoid motions) ?? I came across mujoco, physx , pybullet, issac etc but not sure which to go with. Isaac seems to be something very interesting but the minimun requirements as per the website is 32gb of RAM which is way to much for me (I use a 8gb one). mujoco is good but the docs are very confusing and hard to get through. what do you believe is the best choice to go with?? submitted by /u/rakk109 [link] [comments]  ( 9 min )
    [D] Ensemble of Strong vs Weak Predictors
    This crossed my mind recently and after searching online I couldn't find a concrete answer: would an ensemble composed of strong predictors (let's say training on 1 model of that type had a high metric performance) perform better than an ensemble composed of weak predictors? Bonus: are there any resources that would support your position you can link below? submitted by /u/robml [link] [comments]  ( 9 min )
    [R] Decoupling Features and Classes with Self-Organizing Class Embeddings
    submitted by /u/4rtemi5 [link] [comments]  ( 9 min )
    [P] [D] Hierarchical agent learns all possible policies. Would this implementation work?
    Here's my implementation of an idea I had many years ago: a Sensorimotor Inference Engine. A machine that explores the states space of an environment, learning how to traverse the state space, learning how to manipulate the environment, which when given a goal can manipulate the environment in accordance to the goal. In other words, it's an agent which learns not one policy, but all possible policies. Doing so, I believe requires a hierarchy: layers of the same structure which learn broader and broader contexts of the environment. I have recently attempted to design an extremely simple, and modularized version of this agent: The Encoder-Predictor-Actor circuit. I need feedback, do you think it would work? if it might work, how might I train the Actor model? I think I know how to train the Encoder and Predictor models, but the Actor model will be harder to train, so if you have any ideas I'd love to hear from you! ps. sorry for the typos in the image text. a first-pass diagram of the 'simplest' implementation of a sensorimotor inference engine: the encoder-predictor-actor circuit submitted by /u/Stack3 [link] [comments]  ( 9 min )
    Career suggestions [D]
    Hi there, I need some suggestions from you experts. I am an aerospace engineer (both BSc and MSc), with a university minor in AI. It's pretty clear to me that I should have studied computer science given my passion for this world. In the last 4 years I worked as engineer in a major aerospace company, and I managed to get back on track with computer science and ML by working as a data scientist and doing ML projects applied to space, while also practicing with LLM agents. My dream is to enter the AGI world, maybe working as an "AI engineer", or working on creating true "autonomous" systems, leveraging multi-modal models maybe. What do you suggest I should focus on to reach this goal? Getting first some "credit" as an ML engineer though courses and certifications, open source projects, or maybe applying right now to some startups in the field? Thanks guys! submitted by /u/cappellino1 [link] [comments]  ( 9 min )
    A[r]xiv Dives - Generating Speech from Text with Fast Speech-2
    We’ve been diving deep into Arxiv Papers as a team on Fridays, hope it’s helpful and feel free to join live if you like the format! submitted by /u/FallMindless3563 [link] [comments]  ( 9 min )
    [R] Eureka: Human-Level Reward Design via Coding Large Language Models
    submitted by /u/MysteryInc152 [link] [comments]  ( 8 min )
    Computer Vision Project Ideas [Project]
    I am taking the computer vision course at my university. We have to do a final project but I am unable to come up with concrete ideas. These are the options: • Select a paper from the computer vision literature, implement and test the approach described in that paper • Take publicly available code, apply it to an interesting novel dataset and explore various extensions and modifications. You may also want to compare two or more systems. Running existing code on the data provided by the authors is not sufficient. • Design and implement a solution to a problem that interests you. This may earn you extra credits. Can anyone please help with what to do? submitted by /u/kxenak [link] [comments]  ( 9 min )
    [D] Can you use a different dataset to run ablation experiments?
    I am on a computer vision algorithm and I will be benchmarking my method on the MS COCO dataset, like the other methods that have been proposed for the same problem. I want to know if I can use a smaller dataset (COCO minitrain) for my ablation experiments to demonstrate the efficacy of the different components used in my algorithm and to save time and cost, or will that be a red flag to journal reviewers? submitted by /u/notEVOLVED [link] [comments]  ( 9 min )
    Searching pinecone for relative date information [R]
    I am embedding with gpt and upserting large medical reports into pinecone and then would like to query for chronological result. For example, I upload a report that consists of 10 office visits. I would like to know the date and results of the first visit and then the last visit. when I embed a query containing: How did the patient describe their pain in the last office visit in the text? pinecone doesn't understand the context of 'last' since it is just doing cosine likeness. It pulls pain information but doesn't have a clue which comes first. Any help would be greatly appreciated. submitted by /u/Silent_Case_3058 [link] [comments]  ( 9 min )
    [P] Wizard101 Auto-Buyer Script/Bot - Using OCR, OpenCV Python with multiprocessor performance improvements
    submitted by /u/HistorianCrafty3514 [link] [comments]  ( 8 min )
    [P] [D] : RAG on multilevel tabular data
    Hi, Has anyone done RAG on a multi level tabular data? If yes then what problems have you faced and how did you solve those? My model gives better answers when I converted the data to a JSON and then embedded it. But I'm looking for a better approach. submitted by /u/Euphoric-Chart1428 [link] [comments]  ( 9 min )
    [D] Is Megabyte's padding the same as streamingLLM?
    I was wondering after reading the recent streamingLLM paper https://arxiv.org/pdf/2309.17453.pdf if the attention sink they use through pre-training and inference is analogous to the learnable padding used in the MEGABYTE architecture https://arxiv.org/pdf/2305.07185.pdf although used for a different purpose? So if I just used MEGABYTE with sliding window attention at inference would it be the same as streamingLLM? submitted by /u/Additional-Ad-7043 [link] [comments]  ( 9 min )
    [D] cloud computing vs personal for ML
    I need a new PC to run NN on. My training sets are about 50GB. Would I be best building my own, or using Google colab pro? Anyone know the specs equivalent to colab Pro? submitted by /u/ajplant [link] [comments]  ( 9 min )
    [D] What is the current SOTA of self-supervised knowledge graph models?
    I want to create a research proposal in this area. Ideally, I would like to work towards self-supervised models that take as input raw (not preprocessed) data of various modalities (text, image, video, audio, ...) and output a knowledge graph of all the data contained within. For example, I could feed it the Wikipedia article about dogs and it spits back all the information contained within, structured in the form of a graph. For people who work in the same general area can you point me to the SOTA models/efforts and research groups that work in this area? And can you also highlight the current challenges to be overcome, if you are deep enough to know? ​ submitted by /u/KlutzyBiz [link] [comments]
    [D] Encoder vs Decoder Transformer for Token Classification
    Hi. I am working on TokenClassification problem which requires significant language understanding in the base model and was wondering if:- Is there any research that has shown on multiple datasets that encoder-only pretraining tasks produce more optimal results when finetuned for Token Classification tasks compared to decoder-only with same parameter sized models. Since a lot of LLM research is focused on text generation, most model are trained on decoder-only pretraining tasks, so what is the largest encoder-only pretrained model that is trained on >1T tokens. If encoder-only models do indeed produce more optimal results for Token Classification is there any empirical rule w.r.t. to parameter size that we can expect decoder-only to outperform encode-only models. (Eg. say 3B decoder-only is equivalent to 1B encoder-only with similar pretraining and finetuning data) submitted by /u/RemoteSaint [link] [comments]  ( 9 min )
    [D] Need some practical advice on choosing from different CNN model architectures.
    Hi everyone. I would just like to discuss a few things. I've spent about 2 months studying CNNs on coursera from the Deep Learning Specialization. In this time period I learnt the fundamentals and mechanisms of how CNNs work. I also took lectures on a few research papers that studied a few classical CNN models like AlexNet, LeNet-5, VGG-16. And then a few research papers that studied advanced stuff like ResNets, Inception Network, MobileNet, EfficientNet etc. Following that I studied Detection Algorithms, with a primary focus on YOLO Algorithm. I also briefly studied Regional Proposals, Semantic Segmentation, R-CNN, Fast-RCNN, Faster R-CNN, U-Net. I also learnt Face Recognition and Verification Models like Siamese Network using Triplet Loss function and Binary Classification. And also cove…  ( 10 min )
    [D] [P] Web browsing UI-based AI agent: GPT-4V-Act
    Github: GPT-4V-Act (A demo video can be found on the Github) Hi there! I'd like to share with you a project I recently developed. My inspiration came from a recent post about Set-of-Mark visual grounding in GPT-4V. Fascinatingly, my tests showed that GPT-4V, equipped with this capability, could inspect a UI screenshot and provide the precise pixel coordinates needed for steering a mouse/keyboard to perform a specified task. Motivated by this, I built a proof-of-concept web browser embedded with a co-pilot that can "view" the browser and interact with it. Currently, the demo is basic, utilizing web-scraping to morph ChatGPT Plus into an unofficial GPT-4V API at the backend. It lacks some actions and an adblock, resulting in the agent potentially being overloaded by the extensive popups …  ( 10 min )
  • Open

    Grade School & Preteen AI & Data Literacy
    I recently wrote the book “AI & Data Literacy: Empowering Citizens of Data Science” to help non-data scientists – which is most of the world – understand the risks associated with how companies capture and use your personal data to influence your viewing and buying habits… and even your political and societal beliefs.  And while… Read More »Grade School & Preteen AI & Data Literacy The post Grade School & Preteen AI & Data Literacy appeared first on Data Science Central.  ( 22 min )
  • Open

    Is there any neural network or LLM like chatgpt,midjourney that can help us train and generate custom sounds
    ​ Generating a Wide Variety of Sounds I'm a non-technical person with very little knowledge to develop AI tools and intending to learn Python and based on that My question is as follows: ​ Are there tools or chatgpt like platforms that can help people like me to generate couple of sounds like dog barks, cat meows. I want either something that can generate a variety of sounds or I want to work towards making something that cane help me generate audios like dog barks, such as fierce, aggressive ones but not just limited to dog barks but also sound focused on nature, other animals, vehicles, machinery(e.g., honks, engine sounds ), and possibly human sounds (though that's not my primary focus for now). The amount of technical Assistance Needed I also came across a tool like Teachable Machine and was wondering if it could be a solution as it does offer tools for audio. I am also aware that I would need datasets for such a task but apart from that I am not too sure about the nitty gritty or should I say the intricacies involved as well as the knowledge needed as I do assume it is likely not very easy https://www.youtube.com/watch?v=L4GOmYPPqn8&t=1854s ​ [Teachable Machine](https://teachablemachine.withgoogle.com/) ​ Inspiration I was inspired by a project I found here: [https://x.com/TheAIAnonGuy/status/1684443155448360961?s=20] ​ ​ Can anyone provide insights, guidance, or recommendations on how to accomplish this? To be fair, I'm not really sure if this is an audio-related or neural/machine learning (ML)/deep learning related learning question. But I would like more insight if this is possible on an individual scale either with teachable, code or AI or a combination of all approaches and if there are any beginner friendly ways to achieve this Thank you all for your assistance! submitted by /u/Beginning_Finding_98 [link] [comments]
  • Open

    A Unified Approach to Domain Incremental Learning with Memory: Theory and Algorithm. (arXiv:2310.12244v1 [cs.LG])
    Domain incremental learning aims to adapt to a sequence of domains with access to only a small subset of data (i.e., memory) from previous domains. Various methods have been proposed for this problem, but it is still unclear how they are related and when practitioners should choose one method over another. In response, we propose a unified framework, dubbed Unified Domain Incremental Learning (UDIL), for domain incremental learning with memory. Our UDIL **unifies** various existing methods, and our theoretical analysis shows that UDIL always achieves a tighter generalization error bound compared to these methods. The key insight is that different existing methods correspond to our bound with different **fixed** coefficients; based on insights from this unification, our UDIL allows **adaptive** coefficients during training, thereby always achieving the tightest bound. Empirical results show that our UDIL outperforms the state-of-the-art domain incremental learning methods on both synthetic and real-world datasets. Code will be available at https://github.com/Wang-ML-Lab/unified-continual-learning.  ( 2 min )
    Cooperative Minibatching in Graph Neural Networks. (arXiv:2310.12403v1 [cs.LG])
    Significant computational resources are required to train Graph Neural Networks (GNNs) at a large scale, and the process is highly data-intensive. One of the most effective ways to reduce resource requirements is minibatch training coupled with graph sampling. GNNs have the unique property that items in a minibatch have overlapping data. However, the commonly implemented Independent Minibatching approach assigns each Processing Element (PE) its own minibatch to process, leading to duplicated computations and input data access across PEs. This amplifies the Neighborhood Explosion Phenomenon (NEP), which is the main bottleneck limiting scaling. To reduce the effects of NEP in the multi-PE setting, we propose a new approach called Cooperative Minibatching. Our approach capitalizes on the fact that the size of the sampled subgraph is a concave function of the batch size, leading to significant reductions in the amount of work per seed vertex as batch sizes increase. Hence, it is favorable for processors equipped with a fast interconnect to work on a large minibatch together as a single larger processor, instead of working on separate smaller minibatches, even though global batch size is identical. We also show how to take advantage of the same phenomenon in serial execution by generating dependent consecutive minibatches. Our experimental evaluations show up to 4x bandwidth savings for fetching vertex embeddings, by simply increasing this dependency without harming model convergence. Combining our proposed approaches, we achieve up to 64% speedup over Independent Minibatching on single-node multi-GPU systems.  ( 3 min )
    Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer. (arXiv:2310.12442v1 [cs.CL])
    Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However, the (full) attention mechanism incurs high computational cost - quadratic in the sequence length, which is not affordable in tasks with long sequences, e.g., inputs with 8k tokens. Although sparse attention can be used to improve computational efficiency, as suggested in existing work, it has limited modeling capacity and often fails to capture complicated dependencies in long sequences. To tackle this challenge, we propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. Specifically, MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. For the remaining layers, MASformer only employs sparse attention to capture short-range dependencies. Our experiments on natural language modeling and generation tasks show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention while significantly reducing computational cost (up to 75%). Additionally, we investigate the effectiveness of continual training with long sequence data and how sequence length impacts downstream generation performance, which may be of independent interest.  ( 2 min )
    How a student becomes a teacher: learning and forgetting through Spectral methods. (arXiv:2310.12612v1 [cs.LG])
    In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.  ( 3 min )
    An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning. (arXiv:2310.12274v1 [cs.CV])
    Textural Inversion, a prompt learning method, learns a singular embedding for a new "word" to represent image style and appearance, allowing it to be integrated into natural language sentences to generate novel synthesised images. However, identifying and integrating multiple object-level concepts within one scene poses significant challenges even when embeddings for individual concepts are attainable. This is further confirmed by our empirical tests. To address this challenge, we introduce a framework for Multi-Concept Prompt Learning (MCPL), where multiple new "words" are simultaneously learned from a single sentence-image pair. To enhance the accuracy of word-concept correlation, we propose three regularisation techniques: Attention Masking (AttnMask) to concentrate learning on relevant areas; Prompts Contrastive Loss (PromptCL) to separate the embeddings of different concepts; and Bind adjective (Bind adj.) to associate new "words" with known words. We evaluate via image generation, editing, and attention visualisation with diverse images. Extensive quantitative comparisons demonstrate that our method can learn more semantically disentangled concepts with enhanced word-concept correlation. Additionally, we introduce a novel dataset and evaluation protocol tailored for this new task of learning object-level concepts.  ( 2 min )
    No-Regret Learning in Bilateral Trade via Global Budget Balance. (arXiv:2310.12370v1 [cs.GT])
    Bilateral trade revolves around the challenge of facilitating transactions between two strategic agents -- a seller and a buyer -- both of whom have a private valuations for the item. We study the online version of the problem, in which at each time step a new seller and buyer arrive. The learner's task is to set a price for each agent, without any knowledge about their valuations. The sequence of sellers and buyers is chosen by an oblivious adversary. In this setting, known negative results rule out the possibility of designing algorithms with sublinear regret when the learner has to guarantee budget balance for each iteration. In this paper, we introduce the notion of global budget balance, which requires the agent to be budget balance only over the entire time horizon. By requiring global budget balance, we provide the first no-regret algorithms for bilateral trade with adversarial inputs under various feedback models. First, we show that in the full-feedback model the learner can guarantee $\tilde{O}(\sqrt{T})$ regret against the best fixed prices in hindsight, which is order-wise optimal. Then, in the case of partial feedback models, we provide an algorithm guaranteeing a $\tilde{O}(T^{3/4})$ regret upper bound with one-bit feedback, which we complement with a nearly-matching lower bound. Finally, we investigate how these results vary when measuring regret using an alternative benchmark.  ( 2 min )
    Automated Repair of Declarative Software Specifications in the Era of Large Language Models. (arXiv:2310.12425v1 [cs.SE])
    The growing adoption of declarative software specification languages, coupled with their inherent difficulty in debugging, has underscored the need for effective and automated repair techniques applicable to such languages. Researchers have recently explored various methods to automatically repair declarative software specifications, such as template-based repair, feedback-driven iterative repair, and bounded exhaustive approaches. The latest developments in large language models provide new opportunities for the automatic repair of declarative specifications. In this study, we assess the effectiveness of utilizing OpenAI's ChatGPT to repair software specifications written in the Alloy declarative language. Unlike imperative languages, specifications in Alloy are not executed but rather translated into logical formulas and evaluated using backend constraint solvers to identify specification instances and counterexamples to assertions. Our evaluation focuses on ChatGPT's ability to improve the correctness and completeness of Alloy declarative specifications through automatic repairs. We analyze the results produced by ChatGPT and compare them with those of leading automatic Alloy repair methods. Our study revealed that while ChatGPT falls short in comparison to existing techniques, it was able to successfully repair bugs that no other technique could address. Our analysis also identified errors in ChatGPT's generated repairs, including improper operator usage, type errors, higher-order logic misuse, and relational arity mismatches. Additionally, we observed instances of hallucinations in ChatGPT-generated repairs and inconsistency in its results. Our study provides valuable insights for software practitioners, researchers, and tool builders considering ChatGPT for declarative specification repairs.  ( 3 min )
    Classification-Aided Robust Multiple Target Tracking Using Neural Enhanced Message Passing. (arXiv:2310.12407v1 [cs.LG])
    We address the challenge of tracking an unknown number of targets in strong clutter environments using measurements from a radar sensor. Leveraging the range-Doppler spectra information, we identify the measurement classes, which serve as additional information to enhance clutter rejection and data association, thus bolstering the robustness of target tracking. We first introduce a novel neural enhanced message passing approach, where the beliefs obtained by the unified message passing are fed into the neural network as additional information. The output beliefs are then utilized to refine the original beliefs. Then, we propose a classification-aided robust multiple target tracking algorithm, employing the neural enhanced message passing technique. This algorithm is comprised of three modules: a message-passing module, a neural network module, and a Dempster-Shafer module. The message-passing module is used to represent the statistical model by the factor graph and infers target kinematic states, visibility states, and data associations based on the spatial measurement information. The neural network module is employed to extract features from range-Doppler spectra and derive beliefs on whether a measurement is target-generated or clutter-generated. The Dempster-Shafer module is used to fuse the beliefs obtained from both the factor graph and the neural network. As a result, our proposed algorithm adopts a model-and-data-driven framework, effectively enhancing clutter suppression and data association, leading to significant improvements in multiple target tracking performance. We validate the effectiveness of our approach using both simulated and real data scenarios, demonstrating its capability to handle challenging tracking scenarios in practical radar applications.  ( 3 min )
    Safe RLHF: Safe Reinforcement Learning from Human Feedback. (arXiv:2310.12773v1 [cs.AI])
    With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.  ( 2 min )
    Fast Parameter Inference on Pulsar Timing Arrays with Normalizing Flows. (arXiv:2310.12209v1 [astro-ph.IM])
    Pulsar timing arrays (PTAs) perform Bayesian posterior inference with expensive MCMC methods. Given a dataset of ~10-100 pulsars and O(10^3) timing residuals each, producing a posterior distribution for the stochastic gravitational wave background (SGWB) can take days to a week. The computational bottleneck arises because the likelihood evaluation required for MCMC is extremely costly when considering the dimensionality of the search space. Fortunately, generating simulated data is fast, so modern simulation-based inference techniques can be brought to bear on the problem. In this paper, we demonstrate how conditional normalizing flows trained on simulated data can be used for extremely fast and accurate estimation of the SGWB posteriors, reducing the sampling time from weeks to a matter of seconds.  ( 2 min )
    MuseGNN: Interpretable and Convergent Graph Neural Network Layers at Scale. (arXiv:2310.12457v1 [cs.LG])
    Among the many variants of graph neural network (GNN) architectures capable of modeling data with cross-instance relations, an important subclass involves layers designed such that the forward pass iteratively reduces a graph-regularized energy function of interest. In this way, node embeddings produced at the output layer dually serve as both predictive features for solving downstream tasks (e.g., node classification) and energy function minimizers that inherit desirable inductive biases and interpretability. However, scaling GNN architectures constructed in this way remains challenging, in part because the convergence of the forward pass may involve models with considerable depth. To tackle this limitation, we propose a sampling-based energy function and scalable GNN layers that iteratively reduce it, guided by convergence guarantees in certain settings. We also instantiate a full GNN architecture based on these designs, and the model achieves competitive accuracy and scalability when applied to the largest publicly-available node classification benchmark exceeding 1TB in size.  ( 2 min )
    Closed-Form Diffusion Models. (arXiv:2310.12395v1 [cs.LG])
    Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target. For any finite training set, this score function can be evaluated in closed form, but the resulting SGM memorizes its training data and does not generate novel samples. In practice, one approximates the score by training a neural network via score-matching. The error in this approximation promotes generalization, but neural SGMs are costly to train and sample, and the effective regularization this error provides is not well-understood theoretically. In this work, we instead explicitly smooth the closed-form score to obtain an SGM that generates novel samples without training. We analyze our model and propose an efficient nearest-neighbor-based estimator of its score function. Using this estimator, our method achieves sampling times competitive with neural SGMs while running on consumer-grade CPUs.  ( 2 min )
    Exploring Graph Neural Networks for Indian Legal Judgment Prediction. (arXiv:2310.12800v1 [cs.LG])
    The burdensome impact of a skewed judges-to-cases ratio on the judicial system manifests in an overwhelming backlog of pending cases alongside an ongoing influx of new ones. To tackle this issue and expedite the judicial process, the proposition of an automated system capable of suggesting case outcomes based on factual evidence and precedent from past cases gains significance. This research paper centres on developing a graph neural network-based model to address the Legal Judgment Prediction (LJP) problem, recognizing the intrinsic graph structure of judicial cases and making it a binary node classification problem. We explored various embeddings as model features, while nodes such as time nodes and judicial acts were added and pruned to evaluate the model's performance. The study is done while considering the ethical dimension of fairness in these predictions, considering gender and name biases. A link prediction task is also conducted to assess the model's proficiency in anticipating connections between two specified nodes. By harnessing the capabilities of graph neural networks and incorporating fairness analyses, this research aims to contribute insights towards streamlining the adjudication process, enhancing judicial efficiency, and fostering a more equitable legal landscape, ultimately alleviating the strain imposed by mounting case backlogs. Our best-performing model with XLNet pre-trained embeddings as its features gives the macro F1 score of 75% for the LJP task. For link prediction, the same set of features is the best performing giving ROC of more than 80%
    Generative modeling, design and analysis of spider silk protein sequences for enhanced mechanical properties. (arXiv:2309.10170v1 [cond-mat.mtrl-sci] CROSS LISTED)
    Spider silks are remarkable materials characterized by superb mechanical properties such as strength, extensibility and lightweightedness. Yet, to date, limited models are available to fully explore sequence-property relationships for analysis and design. Here we propose a custom generative large-language model to enable design of novel spider silk protein sequences to meet complex combinations of target mechanical properties. The model, pretrained on a large set of protein sequences, is fine-tuned on ~1,000 major ampullate spidroin (MaSp) sequences for which associated fiber-level mechanical properties exist, to yield an end-to-end forward and inverse generative strategy. Performance is assessed through: (1), a novelty analysis and protein type classification for generated spidroin sequences through BLAST searches, (2) property evaluation and comparison with similar sequences, (3) comparison of molecular structures, as well as, and (4) a detailed sequence motif analyses. We generate silk sequences with property combinations that do not exist in nature, and develop a deep understanding the mechanistic roles of sequence patterns in achieving overarching key mechanical properties (elastic modulus, strength, toughness, failure strain). The model provides an efficient approach to expand the silkome dataset, facilitating further sequence-structure analyses of silks, and establishes a foundation for synthetic silk design and optimization.
    TabuLa: Harnessing Language Models for Tabular Data Synthesis. (arXiv:2310.12746v1 [cs.LG])
    Given the ubiquitous use of tabular data in industries and the growing concerns in data privacy and security, tabular data synthesis emerges as a critical research area. The recent state-of-the-art methods show that large language models (LLMs) can be adopted to generate realistic tabular data. As LLMs pre-process tabular data as full text, they have the advantage of avoiding the curse of dimensionality associated with one-hot encoding high-dimensional data. However, their long training time and limited re-usability on new tasks prevent them from replacing exiting tabular generative models. In this paper, we propose Tabula, a tabular data synthesizer based on the language model structure. Through Tabula, we demonstrate the inherent limitation of employing pre-trained language models designed for natural language processing (NLP) in the context of tabular data synthesis. Our investigation delves into the development of a dedicated foundational model tailored specifically for tabular data synthesis. Additionally, we propose a token sequence compression strategy to significantly reduce training time while preserving the quality of synthetic data. Extensive experiments on six datasets demonstrate that using a language model structure without loading the well-trained model weights yields a better starting model for tabular data synthesis. Moreover, the Tabula model, previously trained on other tabular data, serves as an excellent foundation model for new tabular data synthesis tasks. Additionally, the token sequence compression method substantially reduces the model's training time. Results show that Tabula averagely reduces 46.2% training time per epoch comparing to current LLMs-based state-of-the-art algorithm and consistently achieves even higher synthetic data utility.
    The Power of Populations in Decentralized Learning Dynamics. (arXiv:2306.08670v2 [cs.LG] UPDATED)
    We study a distributed multi-armed bandit setting among a population of $n$ memory-constrained nodes in the gossip model: at each round, every node locally adopts one of $m$ arms, observes a reward drawn from the arm's (adversarially chosen) distribution, and then communicates with a randomly sampled neighbor, exchanging information to determine its policy in the next round. We introduce and analyze several families of dynamics for this task that are decentralized: each node's decision is entirely local and depends only on its most recently obtained reward and that of the neighbor it sampled. We show a connection between the global evolution of these decentralized dynamics with a certain class of "zero-sum" multiplicative weights update algorithms, and we develop a general framework for analyzing the population-level regret of these natural protocols. Using this framework, we derive sublinear regret bounds under a wide range of parameter regimes (i.e., the size of the population and number of arms) for both the stationary reward setting (where the mean of each arm's distribution is fixed over time) and the adversarial reward setting (where means can vary over time). Further, we show that these protocols can approximately optimize convex functions over the simplex when the reward distributions are generated from a stochastic gradient oracle.
    Adaptive Pairwise Encodings for Link Prediction. (arXiv:2310.11009v2 [cs.LG] UPDATED)
    Link prediction is a common task on graph-structured data that has seen applications in a variety of domains. Classically, hand-crafted heuristics were used for this task. Heuristic measures are chosen such that they correlate well with the underlying factors related to link formation. In recent years, a new class of methods has emerged that combines the advantages of message-passing neural networks (MPNN) and heuristics methods. These methods perform predictions by using the output of an MPNN in conjunction with a "pairwise encoding" that captures the relationship between nodes in the candidate link. They have been shown to achieve strong performance on numerous datasets. However, current pairwise encodings often contain a strong inductive bias, using the same underlying factors to classify all links. This limits the ability of existing methods to learn how to properly classify a variety of different links that may form from different factors. To address this limitation, we propose a new method, LPFormer, which attempts to adaptively learn the pairwise encodings for each link. LPFormer models the link factors via an attention module that learns the pairwise encoding that exists between nodes by modeling multiple factors integral to link prediction. Extensive experiments demonstrate that LPFormer can achieve SOTA performance on numerous datasets while maintaining efficiency.
    Schema First! Learn Versatile Knowledge Graph Embeddings by Capturing Semantics with MASCHInE. (arXiv:2306.03659v2 [cs.AI] UPDATED)
    Knowledge graph embedding models (KGEMs) have gained considerable traction in recent years. These models learn a vector representation of knowledge graph entities and relations, a.k.a. knowledge graph embeddings (KGEs). Learning versatile KGEs is desirable as it makes them useful for a broad range of tasks. However, KGEMs are usually trained for a specific task, which makes their embeddings task-dependent. In parallel, the widespread assumption that KGEMs actually create a semantic representation of the underlying entities and relations (e.g., project similar entities closer than dissimilar ones) has been challenged. In this work, we design heuristics for generating protographs -- small, modified versions of a KG that leverage RDF/S information. The learnt protograph-based embeddings are meant to encapsulate the semantics of a KG, and can be leveraged in learning KGEs that, in turn, also better capture semantics. Extensive experiments on various evaluation benchmarks demonstrate the soundness of this approach, which we call Modular and Agnostic SCHema-based Integration of protograph Embeddings (MASCHInE). In particular, MASCHInE helps produce more versatile KGEs that yield substantially better performance for entity clustering and node classification tasks. For link prediction, using MASCHinE substantially increases the number of semantically valid predictions with equivalent rank-based performance.
    Machine Learning Based Compensation for Inconsistencies in Knitted Force Sensors. (arXiv:2306.12129v2 [eess.SY] UPDATED)
    Knitted sensors frequently suffer from inconsistencies due to innate effects such as offset, relaxation, and drift. These properties, in combination, make it challenging to reliably map from sensor data to physical actuation. In this paper, we demonstrate a method for counteracting this by applying processing using a minimal artificial neural network (ANN) in combination with straightforward pre-processing. We apply a number of exponential smoothing filters on a re-sampled sensor signal, to produce features that preserve different levels of historical sensor data and, in combination, represent an adequate state of previous sensor actuation. By training a three-layer ANN with a total of 8 neurons, we manage to significantly improve the mapping between sensor reading and actuation force. Our findings also show that our technique translates to sensors of reasonably different composition in terms of material and structure, and it can furthermore be applied to related physical features such as strain.
    SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training. (arXiv:2310.02227v2 [cs.LG] UPDATED)
    In an era where symbolic mathematical equations are indispensable for modeling complex natural phenomena, scientific inquiry often involves collecting observations and translating them into mathematical expressions. Recently, deep learning has emerged as a powerful tool for extracting insights from data. However, existing models typically specialize in either numeric or symbolic domains, and are usually trained in a supervised manner tailored to specific tasks. This approach neglects the substantial benefits that could arise from a task-agnostic unified understanding between symbolic equations and their numeric counterparts. To bridge the gap, we introduce SNIP, a Symbolic-Numeric Integrated Pre-training, which employs joint contrastive learning between symbolic and numeric domains, enhancing their mutual similarities in the pre-trained embeddings. By performing latent space analysis, we observe that SNIP provides cross-domain insights into the representations, revealing that symbolic supervision enhances the embeddings of numeric data and vice versa. We evaluate SNIP across diverse tasks, including symbolic-to-numeric mathematical property prediction and numeric-to-symbolic equation discovery, commonly known as symbolic regression. Results show that SNIP effectively transfers to various tasks, consistently outperforming fully supervised baselines and competing strongly with established task-specific methods, especially in few-shot learning scenarios where available data is limited.
    Parallel Bayesian Optimization Using Satisficing Thompson Sampling for Time-Sensitive Black-Box Optimization. (arXiv:2310.12526v1 [cs.LG])
    Bayesian optimization (BO) is widely used for black-box optimization problems, and have been shown to perform well in various real-world tasks. However, most of the existing BO methods aim to learn the optimal solution, which may become infeasible when the parameter space is extremely large or the problem is time-sensitive. In these contexts, switching to a satisficing solution that requires less information can result in better performance. In this work, we focus on time-sensitive black-box optimization problems and propose satisficing Thompson sampling-based parallel Bayesian optimization (STS-PBO) approaches, including synchronous and asynchronous versions. We shift the target from an optimal solution to a satisficing solution that is easier to learn. The rate-distortion theory is introduced to construct a loss function that balances the amount of information that needs to be learned with sub-optimality, and the Blahut-Arimoto algorithm is adopted to compute the target solution that reaches the minimum information rate under the distortion limit at each step. Both discounted and undiscounted Bayesian cumulative regret bounds are theoretically derived for the proposed STS-PBO approaches. The effectiveness of the proposed methods is demonstrated on a fast-charging design problem of Lithium-ion batteries. The results are accordant with theoretical analyses, and show that our STS-PBO methods outperform both sequential counterparts and parallel BO with traditional Thompson sampling in both synchronous and asynchronous settings.
    Constrained Reweighting of Distributions: an Optimal Transport Approach. (arXiv:2310.12447v1 [stat.ML])
    We commonly encounter the problem of identifying an optimally weight adjusted version of the empirical distribution of observed data, adhering to predefined constraints on the weights. Such constraints often manifest as restrictions on the moments, tail behaviour, shapes, number of modes, etc., of the resulting weight adjusted empirical distribution. In this article, we substantially enhance the flexibility of such methodology by introducing a nonparametrically imbued distributional constraints on the weights, and developing a general framework leveraging the maximum entropy principle and tools from optimal transport. The key idea is to ensure that the maximum entropy weight adjusted empirical distribution of the observed data is close to a pre-specified probability distribution in terms of the optimal transport metric while allowing for subtle departures. The versatility of the framework is demonstrated in the context of three disparate applications where data re-weighting is warranted to satisfy side constraints on the optimization problem at the heart of the statistical task: namely, portfolio allocation, semi-parametric inference for complex surveys, and ensuring algorithmic fairness in machine learning algorithms.
    Transformers for scientific data: a pedagogical review for astronomers. (arXiv:2310.12069v2 [astro-ph.IM] UPDATED)
    The deep learning architecture associated with ChatGPT and related generative AI products is known as transformers. Initially applied to Natural Language Processing, transformers and the self-attention mechanism they exploit have gained widespread interest across the natural sciences. The goal of this pedagogical and informal review is to introduce transformers to scientists. The review includes the mathematics underlying the attention mechanism, a description of the original transformer architecture, and a section on applications to time series and imaging data in astronomy. We include a Frequently Asked Questions section for readers who are curious about generative AI or interested in getting started with transformers for their research problem.
    Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction. (arXiv:2310.11466v2 [cs.LG] UPDATED)
    Protein structure-based property prediction has emerged as a promising approach for various biological tasks, such as protein function prediction and sub-cellular location estimation. The existing methods highly rely on experimental protein structure data and fail in scenarios where these data are unavailable. Predicted protein structures from AI tools (e.g., AlphaFold2) were utilized as alternatives. However, we observed that current practices, which simply employ accurately predicted structures during inference, suffer from notable degradation in prediction accuracy. While similar phenomena have been extensively studied in general fields (e.g., Computer Vision) as model robustness, their impact on protein property prediction remains unexplored. In this paper, we first investigate the reason behind the performance decrease when utilizing predicted structures, attributing it to the structure embedding bias from the perspective of structure representation learning. To study this problem, we identify a Protein 3D Graph Structure Learning Problem for Robust Protein Property Prediction (PGSL-RP3), collect benchmark datasets, and present a protein Structure embedding Alignment Optimization framework (SAO) to mitigate the problem of structure embedding bias between the predicted and experimental protein structures. Extensive experiments have shown that our framework is model-agnostic and effective in improving the property prediction of both predicted structures and experimental structures. The benchmark datasets and codes will be released to benefit the community.
    Efficient Dataset Distillation through Alignment with Smooth and High-Quality Expert Trajectories. (arXiv:2310.10541v1 [cs.CV] CROSS LISTED)
    Training a large and state-of-the-art machine learning model typically necessitates the use of large-scale datasets, which, in turn, makes the training and parameter-tuning process expensive and time-consuming. Some researchers opt to distil information from real-world datasets into tiny and compact synthetic datasets while maintaining their ability to train a well-performing model, hence proposing a data-efficient method known as Dataset Distillation (DD). Despite recent progress in this field, existing methods still underperform and cannot effectively replace large datasets. In this paper, unlike previous methods that focus solely on improving the efficacy of student distillation, we are the first to recognize the important interplay between expert and student. We argue the significant impact of expert smoothness when employing more potent expert trajectories in subsequent dataset distillation. Based on this, we introduce the integration of clipping loss and gradient penalty to regulate the rate of parameter changes in expert trajectories. Furthermore, in response to the sensitivity exhibited towards randomly initialized variables during distillation, we propose representative initialization for synthetic dataset and balanced inner-loop loss. Finally, we present two enhancement strategies, namely intermediate matching loss and weight perturbation, to mitigate the potential occurrence of cumulative errors. We conduct extensive experiments on datasets of different scales, sizes, and resolutions. The results demonstrate that the proposed method significantly outperforms prior methods.
    The Kernel Density Integral Transformation. (arXiv:2309.10194v2 [stat.ML] UPDATED)
    Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering protection from the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.
    Microscaling Data Formats for Deep Learning. (arXiv:2310.10537v3 [cs.LG] UPDATED)
    Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe.
    Improving Generalization of Alignment with Human Preferences through Group Invariant Learning. (arXiv:2310.11971v2 [cs.LG] UPDATED)
    The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
    When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting. (arXiv:2310.11569v2 [cs.LG] UPDATED)
    Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gap and propose PROFHiT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHiT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHiT over wide range of datasets, we observed 41-88% better performance in accuracy and significantly better calibration. Due to modeling the coherency over full distribution, we observed that PROFHiT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.
    Fed-GraB: Federated Long-tailed Learning with Self-Adjusting Gradient Balancer. (arXiv:2310.07587v2 [cs.LG] UPDATED)
    Data privacy and long-tailed distribution are the norms rather than the exception in many real-world tasks. This paper investigates a federated long-tailed learning (Fed-LT) task in which each client holds a locally heterogeneous dataset; if the datasets can be globally aggregated, they jointly exhibit a long-tailed distribution. Under such a setting, existing federated optimization and/or centralized long-tailed learning methods hardly apply due to challenges in (a) characterizing the global long-tailed distribution under privacy constraints and (b) adjusting the local learning strategy to cope with the head-tail imbalance. In response, we propose a method termed $\texttt{Fed-GraB}$, comprised of a Self-adjusting Gradient Balancer (SGB) module that re-weights clients' gradients in a closed-loop manner, based on the feedback of global long-tailed distribution evaluated by a Direct Prior Analyzer (DPA) module. Using $\texttt{Fed-GraB}$, clients can effectively alleviate the distribution drift caused by data heterogeneity during the model training process and obtain a global model with better performance on the minority classes while maintaining the performance of the majority classes. Extensive experiments demonstrate that $\texttt{Fed-GraB}$ achieves state-of-the-art performance on representative datasets such as CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist.
    ACES: Generating Diverse Programming Puzzles with Autotelic Language Models and Semantic Descriptors. (arXiv:2310.10692v2 [cs.LG] UPDATED)
    Finding and selecting new and interesting problems to solve is at the heart of curiosity, science and innovation. We here study automated problem generation in the context of the open-ended space of python programming puzzles. Existing generative models often aim at modeling a reference distribution without any explicit diversity optimization. Other methods explicitly optimizing for diversity do so either in limited hand-coded representation spaces or in uninterpretable learned embedding spaces that may not align with human perceptions of interesting variations. With ACES (Autotelic Code Exploration via Semantic descriptors), we introduce a new autotelic generation method that leverages semantic descriptors produced by a large language model (LLM) to directly optimize for interesting diversity, as well as few-shot-based generation. Each puzzle is labeled along 10 dimensions, each capturing a programming skill required to solve it. ACES generates and pursues novel and feasible goals to explore that abstract semantic space, slowly discovering a diversity of solvable programming puzzles in any given run. Across a set of experiments, we show that ACES discovers a richer diversity of puzzles than existing diversity-maximizing algorithms as measured across a range of diversity metrics. We further study whether and in which conditions this diversity can translate into the successful training of puzzle solving models.
    URL: A Representation Learning Benchmark for Transferable Uncertainty Estimates. (arXiv:2307.03810v2 [cs.LG] UPDATED)
    Representation learning has significantly driven the field to develop pretrained models that can act as a valuable starting point when transferring to new datasets. With the rising demand for reliable machine learning and uncertainty quantification, there is a need for pretrained models that not only provide embeddings but also transferable uncertainty estimates. To guide the development of such models, we propose the Uncertainty-aware Representation Learning (URL) benchmark. Besides the transferability of the representations, it also measures the zero-shot transferability of the uncertainty estimate using a novel metric. We apply URL to evaluate eleven uncertainty quantifiers that are pretrained on ImageNet and transferred to eight downstream datasets. We find that approaches that focus on the uncertainty of the representation itself or estimate the prediction risk directly outperform those that are based on the probabilities of upstream classes. Yet, achieving transferable uncertainty quantification remains an open challenge. Our findings indicate that it is not necessarily in conflict with traditional representation learning goals. Code is provided under https://github.com/mkirchhof/url .
    On the power of graph neural networks and the role of the activation function. (arXiv:2307.04661v2 [cs.LG] UPDATED)
    In this article we present new results about the expressivity of Graph Neural Networks (GNNs). We prove that for any GNN with piecewise polynomial activations, whose architecture size does not grow with the graph input sizes, there exists a pair of non-isomorphic rooted trees of depth two such that the GNN cannot distinguish their root vertex up to an arbitrary number of iterations. The proof relies on tools from the algebra of symmetric polynomials. In contrast, it was already known that unbounded GNNs (those whose size is allowed to change with the graph sizes) with piecewise polynomial activations can distinguish these vertices in only two iterations. Our results imply a strict separation between bounded and unbounded size GNNs, answering an open question formulated by [Grohe, 2021]. We next prove that if one allows activations that are not piecewise polynomial, then in two iterations a single neuron perceptron can distinguish the root vertices of any pair of nonisomorphic trees of depth two (our results hold for activations like the sigmoid, hyperbolic tan and others). This shows how the power of graph neural networks can change drastically if one changes the activation function of the neural networks. The proof of this result utilizes the Lindemann-Weierstrauss theorem from transcendental number theory.
    In-Context Pretraining: Language Modeling Beyond Document Boundaries. (arXiv:2310.10638v2 [cs.CL] UPDATED)
    Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
    HGCVAE: Integrating Generative and Contrastive Learning for Heterogeneous Graph Learning. (arXiv:2310.11102v3 [cs.LG] UPDATED)
    Generative self-supervised learning (SSL) has exhibited significant potential and garnered increasing interest in graph learning. In this study, we aim to explore the problem of generative SSL in the context of heterogeneous graph learning (HGL). The previous SSL approaches for heterogeneous graphs have primarily relied on contrastive learning, necessitating the design of complex views to capture heterogeneity. However, existing generative SSL methods have not fully leveraged the capabilities of generative models to address the challenges of HGL. In this paper, we present HGCVAE, a novel contrastive variational graph auto-encoder that liberates HGL from the burden of intricate heterogeneity capturing. Instead of focusing on complicated heterogeneity, HGCVAE harnesses the full potential of generative SSL. HGCVAE innovatively consolidates contrastive learning with generative SSL, introducing several key innovations. Firstly, we employ a progressive mechanism to generate high-quality hard negative samples for contrastive learning, utilizing the power of variational inference. Additionally, we present a dynamic mask strategy to ensure effective and stable learning. Moreover, we propose an enhanced scaled cosine error as the criterion for better attribute reconstruction. As an initial step in combining generative and contrastive SSL, HGCVAE achieves remarkable results compared to various state-of-the-art baselines, confirming its superiority.
    Deep Probabilistic Movement Primitives with a Bayesian Aggregator. (arXiv:2307.05141v2 [cs.RO] UPDATED)
    Movement primitives are trainable parametric models that reproduce robotic movements starting from a limited set of demonstrations. Previous works proposed simple linear models that exhibited high sample efficiency and generalization power by allowing temporal modulation of movements (reproducing movements faster or slower), blending (merging two movements into one), via-point conditioning (constraining a movement to meet some particular via-points) and context conditioning (generation of movements based on an observed variable, e.g., position of an object). Previous works have proposed neural network-based motor primitive models, having demonstrated their capacity to perform tasks with some forms of input conditioning or time-modulation representations. However, there has not been a single unified deep motor primitive's model proposed that is capable of all previous operations, limiting neural motor primitive's potential applications. This paper proposes a deep movement primitive architecture that encodes all the operations above and uses a Bayesian context aggregator that allows a more sound context conditioning and blending. Our results demonstrate our approach can scale to reproduce complex motions on a larger variety of input choices compared to baselines while maintaining operations of linear movement primitives provide.
    CONFLATOR: Incorporating Switching Point based Rotatory Positional Encodings for Code-Mixed Language Modeling. (arXiv:2309.05270v2 [cs.CL] UPDATED)
    The mixing of two or more languages is called Code-Mixing (CM). CM is a social norm in multilingual societies. Neural Language Models (NLMs) like transformers have been effective on many NLP tasks. However, NLM for CM is an under-explored area. Though transformers are capable and powerful, they cannot always encode positional information since they are non-recurrent. Therefore, to enrich word information and incorporate positional information, positional encoding is defined. We hypothesize that Switching Points (SPs), i.e., junctions in the text where the language switches (L1 -> L2 or L2 -> L1), pose a challenge for CM Language Models (LMs), and hence give special emphasis to SPs in the modeling process. We experiment with several positional encoding mechanisms and show that rotatory positional encodings along with switching point information yield the best results. We introduce CONFLATOR: a neural language modeling approach for code-mixed languages. CONFLATOR tries to learn to emphasize switching points using smarter positional encoding, both at unigram and bigram levels. CONFLATOR outperforms the state-of-the-art on two tasks based on code-mixed Hindi and English (Hinglish): (i) sentiment analysis and (ii) machine translation.
    Evaluating Superhuman Models with Consistency Checks. (arXiv:2306.09983v3 [cs.LG] UPDATED)
    If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We instantiate our framework on three tasks where correctness of decisions is hard to evaluate due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions, forecasting future events, and making legal judgments. We show that regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making. For example: a chess engine assigning opposing valuations to semantically identical boards; GPT-4 forecasting that sports records will evolve non-monotonically over time; or an AI judge assigning bail to a defendant only after we add a felony to their criminal record.
    Provable Guarantees for Neural Networks via Gradient Feature Learning. (arXiv:2310.12408v1 [cs.LG])
    Neural networks have achieved remarkable empirical performance, while the current theoretical analysis is not adequate for understanding their success, e.g., the Neural Tangent Kernel approach fails to capture their key feature learning ability, while recent analyses on feature learning are typically problem-specific. This work proposes a unified analysis framework for two-layer networks trained by gradient descent. The framework is centered around the principle of feature learning from gradients, and its effectiveness is demonstrated by applications in several prototypical problems, such as mixtures of Gaussians and parity functions. The framework also sheds light on interesting network learning phenomena such as feature learning beyond kernels and the lottery ticket hypothesis.
    Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching. (arXiv:2306.07960v2 [cs.LG] UPDATED)
    Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations. We arrive at this finding analytically, by establishing that the global minimizers of an unconstrained features model with SCL loss and entry-wise non-negativity constraints form an orthogonal frame. Extensive experiments conducted across various datasets, architectures, and imbalance scenarios corroborate our finding. Importantly, our experiments reveal that the inclusion of the ReLU activation restores symmetry without compromising test accuracy. This constitutes the first geometry characterization of SCL under imbalances. Additionally, our analysis and experiments underscore the pivotal role of batch selection strategies in representation geometry. By proving necessary and sufficient conditions for mini-batch choices that ensure invariant symmetric representations, we introduce batch-binding as an efficient strategy that guarantees these conditions hold.
    Automatic Prompt Optimization with "Gradient Descent" and Beam Search. (arXiv:2305.03495v2 [cs.CL] UPDATED)
    Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions.
    On the Design Fundamentals of Diffusion Models: A Survey. (arXiv:2306.04542v3 [cs.LG] UPDATED)
    Diffusion models are generative models, which gradually add and remove noise to learn the underlying distribution of training data for data generation. The components of diffusion models have gained significant attention with many design choices proposed. Existing reviews have primarily focused on higher-level solutions, thereby covering less on the design fundamentals of components. This study seeks to address this gap by providing a comprehensive and coherent review on component-wise design choices in diffusion models. Specifically, we organize this review according to their three key components, namely the forward process, the reverse process, and the sampling procedure. This allows us to provide a fine-grained perspective of diffusion models, benefiting future studies in the analysis of individual components, the applicability of design choices, and the implementation of diffusion models.
    Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning. (arXiv:2306.00477v4 [cs.CL] UPDATED)
    Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning. A similar finding is also observed for the image classification task.
    ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations. (arXiv:2306.08141v2 [cs.AI] UPDATED)
    As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose a new metric to quantify the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.
    Quasi Manhattan Wasserstein Distance. (arXiv:2310.12498v1 [cs.LG])
    The Quasi Manhattan Wasserstein Distance (QMWD) is a metric designed to quantify the dissimilarity between two matrices by combining elements of the Wasserstein Distance with specific transformations. It offers improved time and space complexity compared to the Manhattan Wasserstein Distance (MWD) while maintaining accuracy. QMWD is particularly advantageous for large datasets or situations with limited computational resources. This article provides a detailed explanation of QMWD, its computation, complexity analysis, and comparisons with WD and MWD.
    Detecting and Mitigating Algorithmic Bias in Binary Classification using Causal Modeling. (arXiv:2310.12421v1 [cs.LG])
    This paper proposes the use of causal modeling to detect and mitigate algorithmic bias. We provide a brief description of causal modeling and a general overview of our approach. We then use the Adult dataset, which is available for download from the UC Irvine Machine Learning Repository, to develop (1) a prediction model, which is treated as a black box, and (2) a causal model for bias mitigation. In this paper, we focus on gender bias and the problem of binary classification. We show that gender bias in the prediction model is statistically significant at the 0.05 level. We demonstrate the effectiveness of the causal model in mitigating gender bias by cross-validation. Furthermore, we show that the overall classification accuracy is improved slightly. Our novel approach is intuitive, easy-to-use, and can be implemented using existing statistical software tools such as "lavaan" in R. Hence, it enhances explainability and promotes trust.
    Online Resource Allocation in Episodic Markov Decision Processes. (arXiv:2305.10744v3 [cs.DS] UPDATED)
    This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ where $\rho\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.
    Kepler: Robust Learning for Faster Parametric Query Optimization. (arXiv:2306.06798v2 [cs.DB] UPDATED)
    Most existing parametric query optimization (PQO) techniques rely on traditional query optimizer cost models, which are often inaccurate and result in suboptimal query performance. We propose Kepler, an end-to-end learning-based approach to PQO that demonstrates significant speedups in query latency over a traditional query optimizer. Central to our method is Row Count Evolution (RCE), a novel plan generation algorithm based on perturbations in the sub-plan cardinality space. While previous approaches require accurate cost models, we bypass this requirement by evaluating candidate plans via actual execution data and training an ML model to predict the fastest plan given parameter binding values. Our models leverage recent advances in neural network uncertainty in order to robustly predict faster plans while avoiding regressions in query performance. Experimentally, we show that Kepler achieves significant improvements in query runtime on multiple datasets on PostgreSQL.
    Connecting Multi-modal Contrastive Representations. (arXiv:2305.14381v2 [cs.LG] UPDATED)
    Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. To demonstrate the effectiveness of C-MCR, we connect CLIP and CLAP via texts to derive audio-visual representations, and integrate CLIP and ULIP via images for 3D-language representations. Remarkably, without using any paired data, C-MCR for audio-visual achieves state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced zero-shot 3D point cloud classification accuracy on ModelNet40.
    Seeing double with a multifunctional reservoir computer. (arXiv:2305.05799v2 [math.DS] UPDATED)
    Multifunctional biological neural networks exploit multistability in order to perform multiple tasks without changing any network properties. Enabling artificial neural networks (ANNs) to obtain certain multistabilities in order to perform several tasks, where each task is related to a particular attractor in the network's state space, naturally has many benefits from a machine learning perspective. Given the association to multistability, in this paper we explore how the relationship between different attractors influences the ability of a reservoir computer (RC), which is a dynamical system in the form of an ANN, to achieve multifunctionality. We construct the `seeing double' problem to systematically study how a RC reconstructs a coexistence of attractors when there is an overlap between them. As the amount of overlap increases, we discover that for multifunctionality to occur, there is a critical dependence on a suitable choice of the spectral radius for the RC's internal network connections. A bifurcation analysis reveals how multifunctionality emerges and is destroyed as the RC enters a chaotic regime that can lead to chaotic itinerancy.
    The Adaptive $\tau$-Lasso: Robustness and Oracle Properties. (arXiv:2304.09310v2 [stat.ML] UPDATED)
    This paper introduces a new regularized version of the robust $\tau$-regression estimator for analyzing high-dimensional datasets subject to gross contamination in the response variables and covariates (explanatory variables). The resulting estimator, termed adaptive $\tau$-Lasso, is robust to outliers and high-leverage points. It also incorporates an adaptive $\ell_1$-norm penalty term, which enables the selection of relevant variables and reduces the bias associated with large true regression coefficients. More specifically, this adaptive $\ell_1$-norm penalty term assigns a weight to each regression coefficient. For a fixed number of predictors $p$, we show that the adaptive $\tau$-Lasso has the oracle property, ensuring both variable-selection consistency and asymptotic normality. Asymptotic normality applies only to the entries of the regression vector corresponding to the true support, assuming knowledge of the true regression vector support. We characterize its robustness via the finite-sample breakdown point and the influence function. We carry out extensive simulations and observe that the class of $\tau$-Lasso estimators exhibits robustness and reliable performance in both contaminated and uncontaminated data settings. We also validate our theoretical findings on robustness properties through simulation experiments. In the face of outliers and high-leverage points, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators achieve the best performance or close-to-best performance in terms of prediction and variable selection accuracy compared to other competing regularized estimators for all scenarios considered in this study. Therefore, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators can be effectively employed for a variety of sparse linear regression problems, particularly in high-dimensional settings and when the data is contaminated by outliers and high-leverage points.
    PGA: Personalizing Grasping Agents with Single Human-Robot Interaction. (arXiv:2310.12547v1 [cs.RO])
    Language-Conditioned Robotic Grasping (LCRG) aims to develop robots that ground and grasp objects based on natural language instructions. While robots capable of recognizing personal objects like "my wallet" can interact more naturally with non-expert users, current LCRG systems primarily limit robots to understanding only generic expressions. To this end, we introduce a task scenario GraspMine with a novel dataset that aims to locate and grasp personal objects given personal indicators via learning from a single human-robot interaction. To address GraspMine, we propose Personalized Grasping Agent (PGA), that learns personal objects by propagating user-given information through a Reminiscence-a collection of raw images from the user's environment. Specifically, PGA acquires personal object information by a user presenting a personal object with its associated indicator, followed by PGA inspecting the object by rotating it. Based on the acquired information, PGA pseudo-labels objects in the Reminiscence by our proposed label propagation algorithm. Harnessing the information acquired from the interactions and the pseudo-labeled objects in the Reminiscence, PGA adapts the object grounding model to grasp personal objects. Experiments on GraspMine show that PGA significantly outperforms baseline methods both in offline and online settings, signifying its effectiveness and personalization applicability on real-world scenarios. Finally, qualitative analysis shows the effectiveness of PGA through a detailed investigation of results in each phase.
    STANLEY: Stochastic Gradient Anisotropic Langevin Dynamics for Learning Energy-Based Models. (arXiv:2310.12667v1 [stat.ML])
    We propose in this paper, STANLEY, a STochastic gradient ANisotropic LangEvin dYnamics, for sampling high dimensional data. With the growing efficacy and potential of Energy-Based modeling, also known as non-normalized probabilistic modeling, for modeling a generative process of different natures of high dimensional data observations, we present an end-to-end learning algorithm for Energy-Based models (EBM) with the purpose of improving the quality of the resulting sampled data points. While the unknown normalizing constant of EBMs makes the training procedure intractable, resorting to Markov Chain Monte Carlo (MCMC) is in general a viable option. Realizing what MCMC entails for the EBM training, we propose in this paper, a novel high dimensional sampling method, based on an anisotropic stepsize and a gradient-informed covariance matrix, embedded into a discretized Langevin diffusion. We motivate the necessity for an anisotropic update of the negative samples in the Markov Chain by the nonlinearity of the backbone of the EBM, here a Convolutional Neural Network. Our resulting method, namely STANLEY, is an optimization algorithm for training Energy-Based models via our newly introduced MCMC method. We provide a theoretical understanding of our sampling scheme by proving that the sampler leads to a geometrically uniformly ergodic Markov Chain. Several image generation experiments are provided in our paper to show the effectiveness of our method.
    One-shot Empirical Privacy Estimation for Federated Learning. (arXiv:2302.03098v4 [cs.LG] UPDATED)
    Privacy estimation techniques for differentially private (DP) algorithms are useful for comparing against analytical bounds, or to empirically measure privacy loss in settings where known analytical bounds are not tight. However, existing privacy auditing techniques usually make strong assumptions on the adversary (e.g., knowledge of intermediate model iterates or the training data distribution), are tailored to specific tasks, model architectures, or DP algorithm, and/or require retraining the model many times (typically on the order of thousands). These shortcomings make deploying such techniques at scale difficult in practice, especially in federated settings where model training can take days or weeks. In this work, we present a novel ``one-shot'' approach that can systematically address these challenges, allowing efficient auditing or estimation of the privacy loss of a model during the same, single training run used to fit model parameters, and without requiring any a priori knowledge about the model architecture, task, or DP training algorithm. We show that our method provides provably correct estimates for the privacy loss under the Gaussian mechanism, and we demonstrate its performance on well-established FL benchmark datasets under several adversarial threat models.
    Topic-Level Bayesian Surprise and Serendipity for Recommender Systems. (arXiv:2308.06368v2 [cs.IR] UPDATED)
    A recommender system that optimizes its recommendations solely to fit a user's history of ratings for consumed items can create a filter bubble, wherein the user does not get to experience items from novel, unseen categories. One approach to mitigate this undesired behavior is to recommend items with high potential for serendipity, namely surprising items that are likely to be highly rated. In this paper, we propose a content-based formulation of serendipity that is rooted in Bayesian surprise and use it to measure the serendipity of items after they are consumed and rated by the user. When coupled with a collaborative-filtering component that identifies similar users, this enables recommending items with high potential for serendipity. To facilitate the evaluation of topic-level models for surprise and serendipity, we introduce a dataset of book reading histories extracted from Goodreads, containing over 26 thousand users and close to 1.3 million books, where we manually annotate 449 books read by 4 users in terms of their time-dependent, topic-level surprise. Experimental evaluations show that models that use Bayesian surprise correlate much better with the manual annotations of topic-level surprise than distance-based heuristics, and also obtain better serendipitous item recommendation performance.
    Generative Pretrained Autoregressive Transformer Graph Neural Network applied to the Analysis and Discovery of Novel Proteins. (arXiv:2305.04934v2 [q-bio.BM] CROSS LISTED)
    We report a flexible language-model based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural proteins, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform 8 distinct tasks, with available datasets it can be extended to solve additional problems. In a broader sense, this work illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters that define the nature of the physical system at hand) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties via a synergizing learning capacity to express a set of potentialities embedded in the knowledge used in training, via the interplay of universality and diversity.
    AdANNS: A Framework for Adaptive Semantic Search. (arXiv:2305.19435v2 [cs.LG] UPDATED)
    Web-scale search systems learn an encoder to embed a given query which is then hooked into an approximate nearest neighbor search (ANNS) pipeline to retrieve similar data points. To accurately capture tail queries and data points, learned representations typically are rigid, high-dimensional vectors that are generally used as-is in the entire ANNS pipeline and can lead to computationally expensive retrieval. In this paper, we argue that instead of rigid representations, different stages of ANNS can leverage adaptive representations of varying capacities to achieve significantly better accuracy-compute trade-offs, i.e., stages of ANNS that can get away with more approximate computation should use a lower-capacity representation of the same data point. To this end, we introduce AdANNS, a novel ANNS design framework that explicitly leverages the flexibility of Matryoshka Representations. We demonstrate state-of-the-art accuracy-compute trade-offs using novel AdANNS-based key ANNS building blocks like search data structures (AdANNS-IVF) and quantization (AdANNS-OPQ). For example on ImageNet retrieval, AdANNS-IVF is up to 1.5% more accurate than the rigid representations-based IVF at the same compute budget; and matches accuracy while being up to 90x faster in wall-clock time. For Natural Questions, 32-byte AdANNS-OPQ matches the accuracy of the 64-byte OPQ baseline constructed using rigid representations -- same accuracy at half the cost! We further show that the gains from AdANNS translate to modern-day composite ANNS indices that combine search structures and quantization. Finally, we demonstrate that AdANNS can enable inference-time adaptivity for compute-aware search on ANNS indices built non-adaptively on matryoshka representations. Code is open-sourced at https://github.com/RAIVNLab/AdANNS.
    A path-norm toolkit for modern networks: consequences, promises and challenges. (arXiv:2310.01225v2 [stat.ML] UPDATED)
    This work introduces the first toolkit around path-norms that is fully able to encompass general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on feedforward networks compared to the product of operators' norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
    Understanding Multi-phase Optimization Dynamics and Rich Nonlinear Behaviors of ReLU Networks. (arXiv:2305.12467v3 [cs.LG] UPDATED)
    The training process of ReLU neural networks often exhibits complicated nonlinear phenomena. The nonlinearity of models and non-convexity of loss pose significant challenges for theoretical analysis. Therefore, most previous theoretical works on the optimization dynamics of neural networks focus either on local analysis (like the end of training) or approximate linear models (like Neural Tangent Kernel). In this work, we conduct a complete theoretical characterization of the training process of a two-layer ReLU network trained by Gradient Flow on a linearly separable data. In this specific setting, our analysis captures the whole optimization process starting from random initialization to final convergence. Despite the relatively simple model and data that we studied, we reveal four different phases from the whole training process showing a general simplifying-to-complicating learning trend. Specific nonlinear behaviors can also be precisely identified and captured theoretically, such as initial condensation, saddle-to-plateau dynamics, plateau escape, changes of activation patterns, learning with increasing complexity, etc.
    An Introduction to Transformers. (arXiv:2304.10557v4 [cs.LG] UPDATED)
    The transformer is a neural network component that can be used to learn useful representations of sequences or sets of data-points. The transformer has driven recent advances in natural language processing, computer vision, and spatio-temporal modelling. There are many introductions to transformers, but most do not contain precise mathematical descriptions of the architecture and the intuitions behind the design choices are often also missing. Moreover, as research takes a winding path, the explanations for the components of the transformer can be idiosyncratic. In this note we aim for a mathematically precise, intuitive, and clean description of the transformer architecture. We will not discuss training as this is rather standard. We assume that the reader is familiar with fundamental topics in machine learning including multi-layer perceptrons, linear transformations, softmax functions and basic probability.
    Relational Self-Supervised Learning. (arXiv:2203.08717v2 [cs.CV] UPDATED)
    Self-supervised Learning (SSL) including the mainstream contrastive learning has achieved great success in learning visual representations without data annotations. However, most methods mainly focus on the instance level information (\ie, the different augmented images of the same instance should have the same feature or cluster into the same class), but there is a lack of attention on the relationships between different instances. In this paper, we introduce a novel SSL paradigm, which we term as relational self-supervised learning (ReSSL) framework that learns representations by modeling the relationship between different instances. Specifically, our proposed method employs sharpened distribution of pairwise similarities among different instances as \textit{relation} metric, which is thus utilized to match the feature embeddings of different augmentations. To boost the performance, we argue that weak augmentations matter to represent a more reliable relation, and leverage momentum strategy for practical efficiency. The designed asymmetric predictor head and an InfoNCE warm-up strategy enhance the robustness to hyper-parameters and benefit the resulting performance. Experimental results show that our proposed ReSSL substantially outperforms the state-of-the-art methods across different network architectures, including various lightweight networks (\eg, EfficientNet and MobileNet).
    EDGI: Equivariant Diffusion for Planning with Embodied Agents. (arXiv:2303.12410v2 [cs.LG] UPDATED)
    Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group SE(3), the discrete-time translation group Z, and the object permutation group Sn. EDGI follows the Diffuser framework (Janner et al., 2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new SE(3)xZxSn-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier guidance let us softly break the symmetry for specific tasks as needed. On object manipulation and navigation tasks, EDGI is substantially more sample efficient and generalizes better across the symmetry group than non-equivariant models.
    Data Augmentation for Time-Series Classification: An Extensive Empirical Study and Comprehensive Survey. (arXiv:2310.10060v2 [cs.LG] UPDATED)
    Data Augmentation (DA) has emerged as an indispensable strategy in Time Series Classification (TSC), primarily due to its capacity to amplify training samples, thereby bolstering model robustness, diversifying datasets, and curtailing overfitting. However, the current landscape of DA in TSC is plagued with fragmented literature reviews, nebulous methodological taxonomies, inadequate evaluative measures, and a dearth of accessible, user-oriented tools. In light of these challenges, this study embarks on an exhaustive dissection of DA methodologies within the TSC realm. Our initial approach involved an extensive literature review spanning a decade, revealing that contemporary surveys scarcely capture the breadth of advancements in DA for TSC, prompting us to meticulously analyze over 100 scholarly articles to distill more than 60 unique DA techniques. This rigorous analysis precipitated the formulation of a novel taxonomy, purpose-built for the intricacies of DA in TSC, categorizing techniques into five principal echelons: Transformation-Based, Pattern-Based, Generative, Decomposition-Based, and Automated Data Augmentation. Our taxonomy promises to serve as a robust navigational aid for scholars, offering clarity and direction in method selection. Addressing the conspicuous absence of holistic evaluations for prevalent DA techniques, we executed an all-encompassing empirical assessment, wherein upwards of 15 DA strategies were subjected to scrutiny across 8 UCR time-series datasets, employing ResNet and a multi-faceted evaluation paradigm encompassing Accuracy, Method Ranking, and Residual Analysis, yielding a benchmark accuracy of 88.94 +- 11.83%. Our investigation underscored the inconsistent efficacies of DA techniques, with...
    Can Brain Signals Reveal Inner Alignment with Human Languages?. (arXiv:2208.06348v4 [q-bio.NC] UPDATED)
    Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}.
    DCSI -- An improved measure of cluster separability based on separation and connectedness. (arXiv:2310.12806v1 [stat.ML])
    Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. A review of the existing literature shows that neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate the central aspects of separability for density-based clustering: between-class separation and within-class connectedness. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not form meaningful clusters.
    Language-Guided Traffic Simulation via Scene-Level Diffusion. (arXiv:2306.06344v2 [cs.RO] UPDATED)
    Realistic and controllable traffic simulation is a core capability that is necessary to accelerate autonomous vehicle (AV) development. However, current approaches for controlling learning-based traffic models require significant domain expertise and are difficult for practitioners to use. To remedy this, we present CTG++, a scene-level conditional diffusion model that can be guided by language instructions. Developing this requires tackling two challenges: the need for a realistic and controllable traffic model backbone, and an effective method to interface with a traffic model using language. To address these challenges, we first propose a scene-level diffusion model equipped with a spatio-temporal transformer backbone, which generates realistic and controllable traffic. We then harness a large language model (LLM) to convert a user's query into a loss function, guiding the diffusion model towards query-compliant generation. Through comprehensive evaluation, we demonstrate the effectiveness of our proposed method in generating realistic, query-compliant traffic simulations.
    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. (arXiv:2306.15687v2 [eess.AS] UPDATED)
    Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.
    ROMO: Retrieval-enhanced Offline Model-based Optimization. (arXiv:2310.07560v2 [cs.LG] UPDATED)
    Data-driven black-box model-based optimization (MBO) problems arise in a great number of practical application scenarios, where the goal is to find a design over the whole space maximizing a black-box target function based on a static offline dataset. In this work, we consider a more general but challenging MBO setting, named constrained MBO (CoMBO), where only part of the design space can be optimized while the rest is constrained by the environment. A new challenge arising from CoMBO is that most observed designs that satisfy the constraints are mediocre in evaluation. Therefore, we focus on optimizing these mediocre designs in the offline dataset while maintaining the given constraints rather than further boosting the best observed design in the traditional MBO setting. We propose retrieval-enhanced offline model-based optimization (ROMO), a new derivable forward approach that retrieves the offline dataset and aggregates relevant samples to provide a trusted prediction, and use it for gradient-based optimization. ROMO is simple to implement and outperforms state-of-the-art approaches in the CoMBO setting. Empirically, we conduct experiments on a synthetic Hartmann (3D) function dataset, an industrial CIO dataset, and a suite of modified tasks in the Design-Bench benchmark. Results show that ROMO performs well in a wide range of constrained optimization tasks.
    INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold. (arXiv:2204.07439v3 [cs.CV] UPDATED)
    Binary Neural Networks (BNNs) have emerged as a promising solution for reducing the memory footprint and compute costs of deep neural networks, but they suffer from quality degradation due to the lack of freedom as activations and weights are constrained to the binary values. To compensate for the accuracy drop, we propose a novel BNN design called Binary Neural Network with INSTAnce-aware threshold (INSTA-BNN), which controls the quantization threshold dynamically in an input-dependent or instance-aware manner. According to our observation, higher-order statistics can be a representative metric to estimate the characteristics of the input distribution. INSTA-BNN is designed to adjust the threshold dynamically considering various information, including higher-order statistics, but it is also optimized judiciously to realize minimal overhead on a real device. Our extensive study shows that INSTA-BNN outperforms the baseline by 3.0% and 2.8% on the ImageNet classification task with comparable computing cost, achieving 68.5% and 72.2% top-1 accuracy on ResNet-18 and MobileNetV1 based models, respectively.
    Discretize Relaxed Solution of Spectral Clustering via a Non-Heuristic Algorithm. (arXiv:2310.12752v1 [cs.LG])
    Spectral clustering and its extensions usually consist of two steps: (1) constructing a graph and computing the relaxed solution; (2) discretizing relaxed solutions. Although the former has been extensively investigated, the discretization techniques are mainly heuristic methods, e.g., k-means, spectral rotation. Unfortunately, the goal of the existing methods is not to find a discrete solution that minimizes the original objective. In other words, the primary drawback is the neglect of the original objective when computing the discrete solution. Inspired by the first-order optimization algorithms, we propose to develop a first-order term to bridge the original problem and discretization algorithm, which is the first non-heuristic to the best of our knowledge. Since the non-heuristic method is aware of the original graph cut problem, the final discrete solution is more reliable and achieves the preferable loss value. We also theoretically show that the continuous optimum is beneficial to discretization algorithms though simply finding its closest discrete solution is an existing heuristic algorithm which is also unreliable. Sufficient experiments significantly show the superiority of our method.
    Fairness in Streaming Submodular Maximization over a Matroid Constraint. (arXiv:2305.15118v2 [cs.LG] UPDATED)
    Streaming submodular maximization is a natural model for the task of selecting a representative subset from a large-scale dataset. If datapoints have sensitive attributes such as gender or race, it becomes important to enforce fairness to avoid bias and discrimination. This has spurred significant interest in developing fair machine learning algorithms. Recently, such algorithms have been developed for monotone submodular maximization under a cardinality constraint. In this paper, we study the natural generalization of this problem to a matroid constraint. We give streaming algorithms as well as impossibility results that provide trade-offs between efficiency, quality and fairness. We validate our findings empirically on a range of well-known real-world applications: exemplar-based clustering, movie recommendation, and maximum coverage in social networks.
    PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques. (arXiv:2304.12410v2 [cs.CL] UPDATED)
    Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM). As different PEFT techniques proliferate, it is becoming difficult to compare them, in particular in terms of (i) the structure and functionality they add to the PLM, (ii) the different types and degrees of efficiency improvements achieved, (iii) performance at different downstream tasks, and (iv) how differences in structure and functionality relate to efficiency and task performance. To facilitate such comparisons, this paper presents a reference architecture which standardises aspects shared by different PEFT techniques, while isolating differences to specific locations and interactions with the standard components. Through this process of standardising and isolating differences, a modular view of PEFT techniques emerges, supporting not only direct comparison of different techniques and their efficiency and task performance, but also systematic exploration of reusability and composability of the different types of finetuned modules. We demonstrate how the reference architecture can be applied to understand properties and relative advantages of PEFT techniques, hence to inform selection of techniques for specific tasks, and design choices for new PEFT techniques.
    Hybrid Search for Efficient Planning with Completeness Guarantees. (arXiv:2310.12819v1 [cs.AI])
    Solving complex planning problems has been a long-standing challenge in computer science. Learning-based subgoal search methods have shown promise in tackling these problems, but they often suffer from a lack of completeness guarantees, meaning that they may fail to find a solution even if one exists. In this paper, we propose an efficient approach to augment a subgoal search method to achieve completeness in discrete action spaces. Specifically, we augment the high-level search with low-level actions to execute a multi-level (hybrid) search, which we call complete subgoal search. This solution achieves the best of both worlds: the practical efficiency of high-level search and the completeness of low-level search. We apply the proposed search method to a recently proposed subgoal search algorithm and evaluate the algorithm trained on offline data on complex planning problems. We demonstrate that our complete subgoal search not only guarantees completeness but can even improve performance in terms of search expansions for instances that the high-level could solve without low-level augmentations. Our approach makes it possible to apply subgoal-level planning for systems where completeness is a critical requirement.
    When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting. (arXiv:2206.07940v4 [cs.LG] UPDATED)
    Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gap and propose PROFHiT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHiT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHiT over wide range of datasets, we observed 41-88% better performance in accuracy and significantly better calibration. Due to modeling the coherency over full distribution, we observed that PROFHiT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.
    Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning. (arXiv:2310.12774v1 [cs.CL])
    Prompt-based learning has been an effective paradigm for large pretrained language models (LLM), enabling few-shot or even zero-shot learning. Black-box prompt search has received growing interest recently for its distinctive properties of gradient-free optimization, proven particularly useful and powerful for model-as-a-service usage. However, the discrete nature and the complexity of combinatorial optimization hinder the efficiency of modern black-box approaches. Despite extensive research on search algorithms, the crucial aspect of search space design and optimization has been largely overlooked. In this paper, we first conduct a sensitivity analysis by prompting LLM, revealing that only a small number of tokens exert a disproportionate amount of influence on LLM predictions. Leveraging this insight, we propose the Clustering and Pruning for Efficient Black-box Prompt Search (ClaPS), a simple black-box search method that first clusters and prunes the search space to focus exclusively on influential prompt tokens. By employing even simple search methods within the pruned search space, ClaPS achieves state-of-the-art performance across various tasks and LLMs, surpassing the performance of complex approaches while significantly reducing search costs. Our findings underscore the critical role of search space design and optimization in enhancing both the usefulness and the efficiency of black-box prompt-based learning.
    Tracking electricity losses and their perceived causes using nighttime light and social media. (arXiv:2310.12346v1 [physics.soc-ph])
    Urban environments are intricate systems where the breakdown of critical infrastructure can impact both the economic and social well-being of communities. Electricity systems hold particular significance, as they are essential for other infrastructure, and disruptions can trigger widespread consequences. Typically, assessing electricity availability requires ground-level data, a challenge in conflict zones and regions with limited access. This study shows how satellite imagery, social media, and information extraction can monitor blackouts and their perceived causes. Night-time light data (in March 2019 for Caracas, Venezuela) is used to indicate blackout regions. Twitter data is used to determine sentiment and topic trends, while statistical analysis and topic modeling delved into public perceptions regarding blackout causes. The findings show an inverse relationship between nighttime light intensity. Tweets mentioning the Venezuelan President displayed heightened negativity and a greater prevalence of blame-related terms, suggesting a perception of government accountability for the outages.
    Causal-structure Driven Augmentations for Text OOD Generalization. (arXiv:2310.12803v1 [cs.LG])
    The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.
    Audio Editing with Non-Rigid Text Prompts. (arXiv:2310.12858v1 [cs.SD])
    In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
    Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data. (arXiv:2301.12321v4 [cs.LG] UPDATED)
    Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and SST2. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains. We release codes at https://github.com/snu-mllab/Neural-Relation-Graph.
    Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers. (arXiv:2310.12553v1 [cs.LG])
    The quality of explanations for the predictions of complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how correctly the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both insertion and deletion scores of the explanations while keeping their predictive accuracy. Since the original insertion and deletion metrics are indifferentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics to be differentiable and use them to formalize insertion and deletion metric-based regularizers. The experimental results on image and tabular datasets show that the deep neural networks-based predictors fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easy-to-interpret explanations while keeping high predictive accuracy.
    Example-based Hypernetworks for Out-of-Distribution Generalization. (arXiv:2203.14276v3 [cs.CL] UPDATED)
    As Natural Language Processing (NLP) algorithms continually achieve new milestones, out-of-distribution generalization remains a significant challenge. This paper addresses the issue of multi-source adaptation for unfamiliar domains: We leverage labeled data from multiple source domains to generalize to unknown target domains at training. Our innovative framework employs example-based Hypernetwork adaptation: a T5 encoder-decoder initially generates a unique signature from an input example, embedding it within the source domains' semantic space. This signature is subsequently utilized by a Hypernetwork to generate the task classifier's weights. We evaluated our method across two tasks - sentiment classification and natural language inference - in 29 adaptation scenarios, where it outpaced established algorithms. In an advanced version, the signature also enriches the input example's representation. We also compare our finetuned architecture to few-shot GPT-3, demonstrating its effectiveness in essential use cases. To our knowledge, this marks the first application of Hypernetworks to the adaptation for unknown domains.
    KwaiYiiMath: Technical Report. (arXiv:2310.07488v2 [cs.CL] UPDATED)
    Recent advancements in large language models (LLMs) have demonstrated remarkable abilities in handling a variety of natural language processing (NLP) downstream tasks, even on mathematical tasks requiring multi-step reasoning. In this report, we introduce the KwaiYiiMath which enhances the mathematical reasoning abilities of KwaiYiiBase1, by applying Supervised Fine-Tuning (SFT) and Reinforced Learning from Human Feedback (RLHF), including on both English and Chinese mathematical tasks. Meanwhile, we also constructed a small-scale Chinese primary school mathematics test set (named KMath), consisting of 188 examples to evaluate the correctness of the problem-solving process generated by the models. Empirical studies demonstrate that KwaiYiiMath can achieve state-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with the similar size models, respectively.
    Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems. (arXiv:2211.00617v2 [math.OC] UPDATED)
    We study the global linear convergence of policy gradient (PG) methods for finite-horizon continuous-time exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.
    IC3: Image Captioning by Committee Consensus. (arXiv:2302.01328v3 [cs.CV] UPDATED)
    If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating significant improvements over SOTA approaches for visual description. Code is available at https://davidmchan.github.io/caption-by-committee/
    An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws. (arXiv:2212.01365v2 [cs.LG] UPDATED)
    We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We introduce general error upper bounds for a class of algorithms which incrementally update a statistic (for example gradient descent). For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then derive allocations of computation that minimize this bound. We present empirical results which suggest that this approximation correctly identifies an asymptotic linear compute-optimal scaling. This approximation also generates new insights. Among other things, it suggests that, as the input dimension or latent space complexity grows, as might be the case for example if a longer history of tokens is taken as input to a language model, a larger fraction of the compute budget should be allocated to growing the learning model rather than training data.
    Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units. (arXiv:2212.09730v2 [cs.SD] UPDATED)
    We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines. Code and samples are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.
    Named Entity Recognition for Monitoring Plant Health Threats in Tweets: a ChouBERT Approach. (arXiv:2310.12522v1 [cs.CL])
    An important application scenario of precision agriculture is detecting and measuring crop health threats using sensors and data analysis techniques. However, the textual data are still under-explored among the existing solutions due to the lack of labelled data and fine-grained semantic resources. Recent research suggests that the increasing connectivity of farmers and the emergence of online farming communities make social media like Twitter a participatory platform for detecting unfamiliar plant health events if we can extract essential information from unstructured textual data. ChouBERT is a French pre-trained language model that can identify Tweets concerning observations of plant health issues with generalizability on unseen natural hazards. This paper tackles the lack of labelled data by further studying ChouBERT's know-how on token-level annotation tasks over small labeled sets.
    Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook. (arXiv:2210.13623v3 [cs.AI] UPDATED)
    In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics, and last but not least, the speech and natural language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving the training of deep neural networks with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures and generalizability. In this survey, we present an overview of recent advancements of reinforcement learning and bandits, and discuss how they can be effectively employed to solve speech and natural language processing problems with models that are adaptive, interactive and scalable.
    Deep Discriminative to Kernel Density Networks for Calibrated Inference. (arXiv:2201.13001v6 [cs.LG] UPDATED)
    Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distribution (ID) calibration, such as isotonic regression and Platt's sigmoidal regression, exhibit excellent ID calibration performance but often at the cost of classification accuracy. Moreover, these methods are not calibrated for the entire feature space, leading to overconfidence in the case of out-of-distribution (OOD) samples. In this paper, we leveraged the fact that deep models, including both random forests and deep-nets, learn internal representations which are unions of polytopes with affine activation functions to conceptualize them both as partitioning rules of the feature space. We replace the affine function in each polytope populated by the training data with a Gaussian kernel. We propose sufficient conditions for our proposed methods to be consistent estimators of the corresponding class conditional densities. Moreover, our experiments on both tabular and vision benchmarks show that the proposed approaches obtain well-calibrated posteriors while mostly preserving or improving the classification accuracy of the original algorithm for in-distribution region, and extrapolates beyond the training data to handle out-of-distribution inputs appropriately.
    Multi-label Node Classification On Graph-Structured Data. (arXiv:2304.10398v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, we define homophily and Cross-Class Neighborhood Similarity for the multi-label scenario and provide a thorough analyses of the collected $9$ multi-label datasets. Finally, we perform a large-scale comparative study with $8$ methods and $9$ datasets and analyse the performances of the methods to assess the progress made by current state of the art in the multi-label node classification scenario. We release our benchmark at https://github.com/Tianqi-py/MLGNC.
    A Quasi-Wasserstein Loss for Learning Graph Neural Networks. (arXiv:2310.11762v2 [cs.LG] UPDATED)
    When learning graph neural networks (GNNs) in node-level prediction tasks, most existing loss functions are applied for each node independently, even if node embeddings and their labels are non-i.i.d. because of their graph structures. To eliminate such inconsistency, in this study we propose a novel Quasi-Wasserstein (QW) loss with the help of the optimal transport defined on graphs, leading to new learning and prediction paradigms of GNNs. In particular, we design a "Quasi-Wasserstein" distance between the observed multi-dimensional node labels and their estimations, optimizing the label transport defined on graph edges. The estimations are parameterized by a GNN in which the optimal label transport may determine the graph edge weights optionally. By reformulating the strict constraint of the label transport to a Bregman divergence-based regularizer, we obtain the proposed Quasi-Wasserstein loss associated with two efficient solvers learning the GNN together with optimal label transport. When predicting node labels, our model combines the output of the GNN with the residual component provided by the optimal label transport, leading to a new transductive prediction paradigm. Experiments show that the proposed QW loss applies to various GNNs and helps to improve their performance in node-level classification and regression tasks.
    Blind quantum machine learning with quantum bipartite correlator. (arXiv:2310.12893v1 [quant-ph])
    Distributed quantum computing is a promising computational paradigm for performing computations that are beyond the reach of individual quantum devices. Privacy in distributed quantum computing is critical for maintaining confidentiality and protecting the data in the presence of untrusted computing nodes. In this work, we introduce novel blind quantum machine learning protocols based on the quantum bipartite correlator algorithm. Our protocols have reduced communication overhead while preserving the privacy of data from untrusted parties. We introduce robust algorithm-specific privacy-preserving mechanisms with low computational overhead that do not require complex cryptographic techniques. We then validate the effectiveness of the proposed protocols through complexity and privacy analysis. Our findings pave the way for advancements in distributed quantum computing, opening up new possibilities for privacy-aware machine learning applications in the era of quantum technologies.
    Prompt Injection Attacks and Defenses in LLM-Integrated Applications. (arXiv:2310.12815v1 [cs.CR])
    Large Language Models (LLMs) are increasingly deployed as the backend for a variety of real-world applications called LLM-Integrated Applications. Multiple recent works showed that LLM-Integrated Applications are vulnerable to prompt injection attacks, in which an attacker injects malicious instruction/data into the input of those applications such that they produce results as the attacker desires. However, existing works are limited to case studies. As a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. We aim to bridge the gap in this work. In particular, we propose a general framework to formalize prompt injection attacks. Existing attacks, which are discussed in research papers and blog posts, are special cases in our framework. Our framework enables us to design a new attack by combining existing attacks. Moreover, we also propose a framework to systematize defenses against prompt injection attacks. Using our frameworks, we conduct a systematic evaluation on prompt injection attacks and their defenses with 10 LLMs and 7 tasks. We hope our frameworks can inspire future research in this field. Our code is available at https://github.com/liu00222/Open-Prompt-Injection.
    Fine-Tuning Generative Models as an Inference Method for Robotic Tasks. (arXiv:2310.12862v1 [cs.LG])
    Adaptable models could greatly benefit robotic agents operating in the real world, allowing them to deal with novel and varying conditions. While approaches such as Bayesian inference are well-studied frameworks for adapting models to evidence, we build on recent advances in deep generative models which have greatly affected many areas of robotics. Harnessing modern GPU acceleration, we investigate how to quickly adapt the sample generation of neural network models to observations in robotic tasks. We propose a simple and general method that is applicable to various deep generative models and robotic environments. The key idea is to quickly fine-tune the model by fitting it to generated samples matching the observed evidence, using the cross-entropy method. We show that our method can be applied to both autoregressive models and variational autoencoders, and demonstrate its usability in object shape inference from grasping, inverse kinematics calculation, and point cloud completion.
    Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data. (arXiv:1811.11479v2 [cs.LG] UPDATED)
    On-device machine learning (ML) enables the training process to exploit a massive amount of user-generated private data samples. To enjoy this benefit, inter-device communication overhead should be minimized. With this end, we propose federated distillation (FD), a distributed model training algorithm whose communication payload size is much smaller than a benchmark scheme, federated learning (FL), particularly when the model size is large. Moreover, user-generated data samples are likely to become non-IID across devices, which commonly degrades the performance compared to the case with an IID dataset. To cope with this, we propose federated augmentation (FAug), where each device collectively trains a generative model, and thereby augments its local data towards yielding an IID dataset. Empirical studies demonstrate that FD with FAug yields around 26x less communication overhead while achieving 95-98% test accuracy compared to FL.
    Model-agnostic variable importance for predictive uncertainty: an entropy-based approach. (arXiv:2310.12842v1 [stat.ML])
    In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the model's level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model's predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches in understanding both the sources of uncertainty and their impact on model performance.
    Neural networks with linear threshold activations: structure and algorithms. (arXiv:2111.08117v4 [cs.LG] UPDATED)
    In this article we present new results on neural networks with linear threshold activation functions. We precisely characterize the class of functions that are representable by such neural networks and show that 2 hidden layers are necessary and sufficient to represent any function representable in the class. This is a surprising result in the light of recent exact representability investigations for neural networks using other popular activation functions like rectified linear units (ReLU). We also give precise bounds on the sizes of the neural networks required to represent any function in the class. Finally, we design an algorithm to solve the empirical risk minimization (ERM) problem to global optimality for these neural networks with a fixed architecture. The algorithm's running time is polynomial in the size of the data sample, if the input dimension and the size of the network architecture are considered fixed constants. The algorithm is unique in the sense that it works for any architecture with any number of layers, whereas previous polynomial time globally optimal algorithms work only for very restricted classes of architectures. Using these insights, we propose a new class of neural networks that we call shortcut linear threshold networks. To the best of our knowledge, this way of designing neural networks has not been explored before in the literature. We show that these neural networks have several desirable theoretical properties.
    Generating collective counterfactual explanations in score-based classification via mathematical optimization. (arXiv:2310.12822v1 [stat.ML])
    Due to the increasing use of Machine Learning models in high stakes decision making settings, it has become increasingly important to have tools to understand how models arrive at decisions. Assuming a trained Supervised Classification model, explanations can be obtained via counterfactual analysis: a counterfactual explanation of an instance indicates how this instance should be minimally modified so that the perturbed instance is classified in the desired class by the Machine Learning classification model. Most of the Counterfactual Analysis literature focuses on the single-instance single-counterfactual setting, in which the analysis is done for one single instance to provide one single explanation. Taking a stakeholder's perspective, in this paper we introduce the so-called collective counterfactual explanations. By means of novel Mathematical Optimization models, we provide a counterfactual explanation for each instance in a group of interest, so that the total cost of the perturbations is minimized under some linking constraints. Making the process of constructing counterfactuals collective instead of individual enables us to detect the features that are critical to the entire dataset to have the individuals classified in the desired class. Our methodology allows for some instances to be treated individually, performing the collective counterfactual analysis for a fraction of records of the group of interest. This way, outliers are identified and handled appropriately. Under some assumptions on the classifier and the space in which counterfactuals are sought, finding collective counterfactuals is reduced to solving a convex quadratic linearly constrained mixed integer optimization problem, which, for datasets of moderate size, can be solved to optimality using existing solvers. The performance of our approach is illustrated on real-world datasets, demonstrating its usefulness.
    Hierarchical Forecasting at Scale. (arXiv:2310.12809v1 [cs.LG])
    Existing hierarchical forecasting techniques scale poorly when the number of time series increases. We propose to learn a coherent forecast for millions of time series with a single bottom-level forecast model by using a sparse loss function that directly optimizes the hierarchical product and/or temporal structure. The benefit of our sparse hierarchical loss function is that it provides practitioners a method of producing bottom-level forecasts that are coherent to any chosen cross-sectional or temporal hierarchy. In addition, removing the need for a post-processing step as required in traditional hierarchical forecasting techniques reduces the computational cost of the prediction phase in the forecasting pipeline. On the public M5 dataset, our sparse hierarchical loss function performs up to 10% (RMSE) better compared to the baseline loss function. We implement our sparse hierarchical loss function within an existing forecasting model at bol, a large European e-commerce platform, resulting in an improved forecasting performance of 2% at the product level. Finally, we found an increase in forecasting performance of about 5-10% when evaluating the forecasting performance across the cross-sectional hierarchies that we defined. These results demonstrate the usefulness of our sparse hierarchical loss applied to a production forecasting system at a major e-commerce platform.
    A Theoretical Approach to Characterize the Accuracy-Fairness Trade-off Pareto Frontier. (arXiv:2310.12785v1 [cs.LG])
    While the accuracy-fairness trade-off has been frequently observed in the literature of fair machine learning, rigorous theoretical analyses have been scarce. To demystify this long-standing challenge, this work seeks to develop a theoretical framework by characterizing the shape of the accuracy-fairness trade-off Pareto frontier (FairFrontier), determined by a set of all optimal Pareto classifiers that no other classifiers can dominate. Specifically, we first demonstrate the existence of the trade-off in real-world scenarios and then propose four potential categories to characterize the important properties of the accuracy-fairness Pareto frontier. For each category, we identify the necessary conditions that lead to corresponding trade-offs. Experimental results on synthetic data suggest insightful findings of the proposed framework: (1) When sensitive attributes can be fully interpreted by non-sensitive attributes, FairFrontier is mostly continuous. (2) Accuracy can suffer a \textit{sharp} decline when over-pursuing fairness. (3) Eliminate the trade-off via a two-step streamlined approach. The proposed research enables an in-depth understanding of the accuracy-fairness trade-off, pushing current fair machine-learning research to a new frontier.
    Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair. (arXiv:2309.00608v2 [cs.SE] UPDATED)
    During Automated Program Repair (APR), it can be challenging to synthesize correct patches for real-world systems in general-purpose programming languages. Recent Large Language Models (LLMs) have been shown to be helpful "copilots" in assisting developers with various coding tasks, and have also been directly applied for patch synthesis. However, most LLMs treat programs as sequences of tokens, meaning that they are ignorant of the underlying semantics constraints of the target programming language. This results in plenty of statically invalid generated patches, impeding the practicality of the technique. Therefore, we propose Repilot, a framework to further copilot the AI "copilots" (i.e., LLMs) by synthesizing more valid patches during the repair process. Our key insight is that many LLMs produce outputs autoregressively (i.e., token by token), resembling human writing programs, which can be significantly boosted and guided through a Completion Engine. Repilot synergistically synthesizes a candidate patch through the interaction between an LLM and a Completion Engine, which 1) prunes away infeasible tokens suggested by the LLM and 2) proactively completes the token based on the suggestions provided by the Completion Engine. Our evaluation on a subset of the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot fixes 66 and 50 bugs, respectively, surpassing the best-performing baseline by 14 and 16 bugs fixed. More importantly, Repilot is capable of producing more valid and correct patches than the base LLM when given the same generation budget.
    Bayesian tomography using polynomial chaos expansion and deep generative networks. (arXiv:2307.04228v4 [physics.geo-ph] UPDATED)
    Implementations of Markov chain Monte Carlo (MCMC) methods need to confront two fundamental challenges: accurate representation of prior information and efficient evaluation of likelihoods. Principal component analysis (PCA) and related techniques can in some cases facilitate the definition and sampling of the prior distribution, as well as the training of accurate surrogate models, using for instance, polynomial chaos expansion (PCE). However, complex geological priors with sharp contrasts necessitate more complex dimensionality-reduction techniques, such as, deep generative models (DGMs). By sampling a low-dimensional prior probability distribution defined in the low-dimensional latent space of such a model, it becomes possible to efficiently sample the physical domain at the price of a generator that is typically highly non-linear. Training a surrogate that is capable of capturing intricate non-linear relationships between latent parameters and outputs of forward modeling presents a notable challenge. Indeed, while PCE models provide high accuracy when the input-output relationship can be effectively approximated by relatively low-degree multivariate polynomials, this condition is typically not met when employing latent variables derived from DGMs. In this contribution, we present a strategy combining the excellent reconstruction performances of a variational autoencoder (VAE) with the accuracy of PCA-PCE surrogate modeling in the context of Bayesian ground penetrating radar (GPR) traveltime tomography. Within the MCMC process, the parametrization of the VAE is leveraged for prior exploration and sample proposals. Concurrently, surrogate modeling is conducted using PCE, which operates on either globally or locally defined principal components of the VAE samples under examination.
    Recurrent Neural Language Models as Probabilistic Finite-state Automata. (arXiv:2310.05161v2 [cs.CL] UPDATED)
    Studying language models (LMs) in terms of well-understood formalisms allows us to precisely characterize their abilities and limitations. Previous work has investigated the representational capacity of recurrent neural network (RNN) LMs in terms of their capacity to recognize unweighted formal languages. However, LMs do not describe unweighted formal languages -- rather, they define probability distributions over strings. In this work, we study what classes of such probability distributions RNN LMs can represent, which allows us to make more direct statements about their capabilities. We show that simple RNNs are equivalent to a subclass of probabilistic finite-state automata, and can thus model a strict subset of probability distributions expressible by finite-state models. Furthermore, we study the space complexity of representing finite-state LMs with RNNs. We show that, to represent an arbitrary deterministic finite-state LM with $N$ states over an alphabet $\Sigma$, an RNN requires $\Omega\left(N |\Sigma|\right)$ neurons. These results present a first step towards characterizing the classes of distributions RNN LMs can represent and thus help us understand their capabilities and limitations.
    Neurosymbolic Grounding for Compositional World Models. (arXiv:2310.12690v1 [cs.LG])
    We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CG), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CG on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CG in world modeling.
    Rank-DETR for High Quality Object Detection. (arXiv:2310.08854v2 [cs.CV] UPDATED)
    Modern detection transformers (DETRs) use a set of object queries to predict a list of bounding boxes, sort them by their classification confidence scores, and select the top-ranked predictions as the final detection results for the given input image. A highly performant object detector requires accurate ranking for the bounding box predictions. For DETR-based detectors, the top-ranked bounding boxes suffer from less accurate localization quality due to the misalignment between classification scores and localization accuracy, thus impeding the construction of high-quality detectors. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs, combinedly called Rank-DETR. Our key contributions include: (i) a rank-oriented architecture design that can prompt positive predictions and suppress the negative ones to ensure lower false positive rates, as well as (ii) a rank-oriented loss function and matching cost design that prioritizes predictions of more accurate localization accuracy during ranking to boost the AP under high IoU thresholds. We apply our method to improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong COCO object detection results when using different backbones such as ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.
    Neural networks for insurance pricing with frequency and severity data: a benchmark study from data preprocessing to technical tariff. (arXiv:2310.12671v1 [cs.LG])
    Insurers usually turn to generalized linear models for modelling claim frequency and severity data. Due to their success in other fields, machine learning techniques are gaining popularity within the actuarial toolbox. Our paper contributes to the literature on frequency-severity insurance pricing with machine learning via deep learning structures. We present a benchmark study on four insurance data sets with frequency and severity targets in the presence of multiple types of input features. We compare in detail the performance of: a generalized linear model on binned input data, a gradient-boosted tree model, a feed-forward neural network (FFNN), and the combined actuarial neural network (CANN). Our CANNs combine a baseline prediction established with a GLM and GBM, respectively, with a neural network correction. We explain the data preprocessing steps with specific focus on the multiple types of input features typically present in tabular insurance data sets, such as postal codes, numeric and categorical covariates. Autoencoders are used to embed the categorical variables into the neural network and we explore their potential advantages in a frequency-severity setting. Finally, we construct global surrogate models for the neural nets' frequency and severity models. These surrogates enable the translation of the essential insights captured by the FFNNs or CANNs to GLMs. As such, a technical tariff table results that can easily be deployed in practice.
    Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation. (arXiv:2307.09688v2 [cs.IR] UPDATED)
    Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website https://kddcup23.github.io/.
    zkFL: Zero-Knowledge Proof-based Gradient Aggregation for Federated Learning. (arXiv:2310.02554v3 [cs.AI] UPDATED)
    Federated Learning (FL) is a machine learning paradigm, which enables multiple and decentralized clients to collaboratively train a model under the orchestration of a central aggregator. Traditional FL solutions rely on the trust assumption of the centralized aggregator, which forms cohorts of clients in a fair and honest manner. However, a malicious aggregator, in reality, could abandon and replace the client's training models, or launch Sybil attacks to insert fake clients. Such malicious behaviors give the aggregator more power to control clients in the FL setting and determine the final training results. In this work, we introduce zkFL, which leverages zero-knowledge proofs (ZKPs) to tackle the issue of a malicious aggregator during the training model aggregation process. To guarantee the correct aggregation results, the aggregator needs to provide a proof per round. The proof can demonstrate to the clients that the aggregator executes the intended behavior faithfully. To further reduce the verification cost of clients, we employ a blockchain to handle the proof in a zero-knowledge way, where miners (i.e., the nodes validating and maintaining the blockchain data) can verify the proof without knowing the clients' local and aggregated models. The theoretical analysis and empirical results show that zkFL can achieve better security and privacy than traditional FL, without modifying the underlying FL network structure or heavily compromising the training speed.
    An effective theory of collective deep learning. (arXiv:2310.12802v1 [physics.soc-ph])
    Unraveling the emergence of collective learning in systems of coupled artificial neural networks is an endeavor with broader implications for physics, machine learning, neuroscience and society. Here we introduce a minimal model that condenses several recent decentralized algorithms by considering a competition between two terms: the local learning dynamics in the parameters of each neural network unit, and a diffusive coupling among units that tends to homogenize the parameters of the ensemble. We derive the coarse-grained behavior of our model via an effective theory for linear networks that we show is analogous to a deformed Ginzburg-Landau model with quenched disorder. This framework predicts (depth-dependent) disorder-order-disorder phase transitions in the parameters' solutions that reveal the onset of a collective learning phase, along with a depth-induced delay of the critical point and a robust shape of the microscopic learning path. We validate our theory in realistic ensembles of coupled nonlinear networks trained in the MNIST dataset under privacy constraints. Interestingly, experiments confirm that individual networks -- trained only with private data -- can fully generalize to unseen data classes when the collective learning phase emerges. Our work elucidates the physics of collective learning and contributes to the mechanistic interpretability of deep learning in decentralized settings.
    Provably Powerful Graph Neural Networks for Directed Multigraphs. (arXiv:2306.11586v2 [cs.LG] UPDATED)
    This paper analyses a set of simple adaptations that transform standard message-passing Graph Neural Networks (GNN) into provably powerful directed multigraph neural networks. The adaptations include multigraph port numbering, ego IDs, and reverse message passing. We prove that the combination of these theoretically enables the detection of any directed subgraph pattern. To validate the effectiveness of our proposed adaptations in practice, we conduct experiments on synthetic subgraph detection tasks, which demonstrate outstanding performance with almost perfect results. Moreover, we apply our proposed adaptations to two financial crime analysis tasks. We observe dramatic improvements in detecting money laundering transactions, improving the minority-class F1 score of a standard message-passing GNN by up to 30%, and closely matching or outperforming tree-based and GNN baselines. Similarly impressive results are observed on a real-world phishing detection dataset, boosting three standard GNNs' F1 scores by around 15% and outperforming all baselines.
    Neural Likelihood Approximation for Integer Valued Time Series Data. (arXiv:2310.12544v1 [stat.ML])
    Stochastic processes defined on integer valued state spaces are popular within the physical and biological sciences. These models are necessary for capturing the dynamics of small systems where the individual nature of the populations cannot be ignored and stochastic effects are important. The inference of the parameters of such models, from time series data, is difficult due to intractability of the likelihood; current methods, based on simulations of the underlying model, can be so computationally expensive as to be prohibitive. In this paper we construct a neural likelihood approximation for integer valued time series data using causal convolutions, which allows us to evaluate the likelihood of the whole time series in parallel. We demonstrate our method by performing inference on a number of ecological and epidemiological models, showing that we can accurately approximate the true posterior while achieving significant computational speed ups in situations where current methods struggle.
    Inverse Renormalization Group of Disordered Systems. (arXiv:2310.12631v1 [cond-mat.stat-mech])
    We propose inverse renormalization group transformations to construct approximate configurations for lattice volumes that have not yet been accessed by supercomputers or large-scale simulations in the study of spin glasses. Specifically, starting from lattices of volume $V=8^{3}$ in the case of the three-dimensional Edwards-Anderson model we employ machine learning algorithms to construct rescaled lattices up to $V'=128^{3}$, which we utilize to extract two critical exponents. We conclude by discussing how to incorporate numerical exactness within inverse renormalization group approaches of disordered systems, thus opening up the opportunity to explore a sustainable and energy-efficient generation of exact configurations for increasing lattice volumes without the use of dedicated supercomputers.
    Compression of Recurrent Neural Networks using Matrix Factorization. (arXiv:2310.12688v1 [cs.LG])
    Compressing neural networks is a key step when deploying models for real-time or embedded applications. Factorizing the model's matrices using low-rank approximations is a promising method for achieving compression. While it is possible to set the rank before training, this approach is neither flexible nor optimal. In this work, we propose a post-training rank-selection method called Rank-Tuning that selects a different rank for each matrix. Used in combination with training adaptations, our method achieves high compression rates with no or little performance degradation. Our numerical experiments on signal processing tasks show that we can compress recurrent neural networks up to 14x with at most 1.4% relative performance reduction.
    Networkwide Traffic State Forecasting Using Exogenous Information: A Multi-Dimensional Graph Attention-Based Approach. (arXiv:2310.12353v1 [cs.LG])
    Traffic state forecasting is crucial for traffic management and control strategies, as well as user- and system-level decision making in the transportation network. While traffic forecasting has been approached with a variety of techniques over the last couple of decades, most approaches simply rely on endogenous traffic variables for state prediction, despite the evidence that exogenous factors can significantly impact traffic conditions. This paper proposes a multi-dimensional spatio-temporal graph attention-based traffic prediction approach (M-STGAT), which predicts traffic based on past observations of speed, along with lane closure events, temperature, and visibility across the transportation network. The approach is based on a graph attention network architecture, which also learns based on the structure of the transportation network on which these variables are observed. Numerical experiments are performed using traffic speed and lane closure data from the California Department of Transportation (Caltrans) Performance Measurement System (PeMS). The corresponding weather data were downloaded from the National Oceanic and Atmospheric Administration (NOOA) Automated Surface Observing Systems (ASOS). For comparison, the numerical experiments implement three alternative models which do not allow for the multi-dimensional input. The M-STGAT is shown to outperform the three alternative models, when performing tests using our primary data set for prediction with a 30-, 45-, and 60-minute prediction horizon, in terms of three error measures: Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). However, the model's transferability can vary for different transfer data sets and this aspect may require further investigation.
    Time-Aware Representation Learning for Time-Sensitive Question Answering. (arXiv:2310.12585v1 [cs.CL])
    Time is one of the crucial factors in real-world question answering (QA) problems. However, language models have difficulty understanding the relationships between time specifiers, such as 'after' and 'before', and numbers, since existing QA datasets do not include sufficient time expressions. To address this issue, we propose a Time-Context aware Question Answering (TCQA) framework. We suggest a Time-Context dependent Span Extraction (TCSE) task, and build a time-context dependent data generation framework for model training. Moreover, we present a metric to evaluate the time awareness of the QA model using TCSE. The TCSE task consists of a question and four sentence candidates classified as correct or incorrect based on time and context. The model is trained to extract the answer span from the sentence that is both correct in time and context. The model trained with TCQA outperforms baseline models up to 8.5 of the F1-score in the TimeQA dataset. Our dataset and code are available at https://github.com/sonjbin/TCQA
    Voyager: An Open-Ended Embodied Agent with Large Language Models. (arXiv:2305.16291v2 [cs.AI] UPDATED)
    We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent's abilities rapidly and alleviates catastrophic forgetting. Empirically, Voyager shows strong in-context lifelong learning capability and exhibits exceptional proficiency in playing Minecraft. It obtains 3.3x more unique items, travels 2.3x longer distances, and unlocks key tech tree milestones up to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. We open-source our full codebase and prompts at https://voyager.minedojo.org/.
    An ML-assisted OTFS vs. OFDM adaptable modem. (arXiv:2309.01319v2 [eess.SP] UPDATED)
    The Orthogonal-Time-Frequency-Space (OTFS) signaling is known to be resilient to doubly-dispersive channels, which impacts high mobility scenarios. On the other hand, the Orthogonal-Frequency-Division-Multiplexing (OFDM) waveforms enjoy the benefits of the reuse of legacy architectures, simplicity of receiver design, and low-complexity detection. Several studies that compare the performance of OFDM and OTFS have indicated mixed outcomes due to the plethora of system parameters at play beyond high-mobility conditions. In this work, we exemplify this observation using simulations and propose a deep neural network (DNN)-based adaptation scheme to switch between using either an OTFS or OFDM signal processing chain at the transmitter and receiver for optimal mean-squared-error (MSE) performance. The DNN classifier is trained to switch between the two schemes by observing the channel condition, received SNR, and modulation format. We compare the performance of the OTFS, OFDM, and the proposed switched-waveform scheme. The simulations indicate superior performance with the proposed scheme with a well-trained DNN, thus improving the MSE performance of the communication significantly.
    Detection and Evaluation of bias-inducing Features in Machine learning. (arXiv:2310.12805v1 [cs.LG])
    The cause-to-effect analysis can help us decompose all the likely causes of a problem, such as an undesirable business situation or unintended harm to the individual(s). This implies that we can identify how the problems are inherited, rank the causes to help prioritize fixes, simplify a complex problem and visualize them. In the context of machine learning (ML), one can use cause-to-effect analysis to understand the reason for the biased behavior of the system. For example, we can examine the root causes of biases by checking each feature for a potential cause of bias in the model. To approach this, one can apply small changes to a given feature or a pair of features in the data, following some guidelines and observing how it impacts the decision made by the model (i.e., model prediction). Therefore, we can use cause-to-effect analysis to identify the potential bias-inducing features, even when these features are originally are unknown. This is important since most current methods require a pre-identification of sensitive features for bias assessment and can actually miss other relevant bias-inducing features, which is why systematic identification of such features is necessary. Moreover, it often occurs that to achieve an equitable outcome, one has to take into account sensitive features in the model decision. Therefore, it should be up to the domain experts to decide based on their knowledge of the context of a decision whether bias induced by specific features is acceptable or not. In this study, we propose an approach for systematically identifying all bias-inducing features of a model to help support the decision-making of domain experts. We evaluated our technique using four well-known datasets to showcase how our contribution can help spearhead the standard procedure when developing, testing, maintaining, and deploying fair/equitable machine learning systems.
    Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic. (arXiv:2310.12660v1 [cs.LG])
    Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.
    Red Teaming Language Model Detectors with Language Models. (arXiv:2305.19713v2 [cs.CL] UPDATED)
    The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent works have proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems.
    Test-Time Distribution Normalization for Contrastively Learned Vision-language Models. (arXiv:2302.11084v2 [cs.LG] UPDATED)
    Advances in the field of vision-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference in a computationally efficient way. To this end, we propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product on top of other existing test-time augmentation methods.
    Post-processing Private Synthetic Data for Improving Utility on Selected Measures. (arXiv:2305.15538v2 [cs.LG] UPDATED)
    Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
    Physics-informed neural networks in the recreation of hydrodynamic simulations from dark matter. (arXiv:2303.14090v2 [astro-ph.CO] UPDATED)
    Physics-informed neural networks have emerged as a coherent framework for building predictive models that combine statistical patterns with domain knowledge. The underlying notion is to enrich the optimization loss function with known relationships to constrain the space of possible solutions. Hydrodynamic simulations are a core constituent of modern cosmology, while the required computations are both expensive and time-consuming. At the same time, the comparatively fast simulation of dark matter requires fewer resources, which has led to the emergence of machine learning algorithms for baryon inpainting as an active area of research; here, recreating the scatter found in hydrodynamic simulations is an ongoing challenge. This paper presents the first application of physics-informed neural networks to baryon inpainting by combining advances in neural network architectures with physical constraints, injecting theory on baryon conversion efficiency into the model loss function. We also introduce a punitive prediction comparison based on the Kullback-Leibler divergence, which enforces scatter reproduction. By simultaneously extracting the complete set of baryonic properties for the Simba suite of cosmological simulations, our results demonstrate improved accuracy of baryonic predictions based on dark matter halo properties, successful recovery of the fundamental metallicity relation, and retrieve scatter that traces the target simulation's distribution.
    Towards a Deep Learning-based Online Quality Prediction System for Welding Processes. (arXiv:2310.12632v1 [cs.LG])
    The digitization of manufacturing processes enables promising applications for machine learning-assisted quality assurance. A widely used manufacturing process that can strongly benefit from data-driven solutions is \ac{GMAW}. The welding process is characterized by complex cause-effect relationships between material properties, process conditions and weld quality. In non-laboratory environments with frequently changing process parameters, accurate determination of weld quality by destructive testing is economically unfeasible. Deep learning offers the potential to identify the relationships in available process data and predict the weld quality from process observations. In this paper, we present a concept for a deep learning based predictive quality system in \ac{GMAW}. At its core, the concept involves a pipeline consisting of four major phases: collection and management of multi-sensor data (e.g. current and voltage), real-time processing and feature engineering of the time series data by means of autoencoders, training and deployment of suitable recurrent deep learning models for quality predictions, and model evolutions under changing process conditions using continual learning. The concept provides the foundation for future research activities in which we will realize an online predictive quality system for running production.
    Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark. (arXiv:2310.12567v1 [cs.AI])
    Artificial intelligence (AI) systems possess significant potential to drive societal progress. However, their deployment often faces obstacles due to substantial safety concerns. Safe reinforcement learning (SafeRL) emerges as a solution to optimize policies while simultaneously adhering to multiple constraints, thereby addressing the challenge of integrating reinforcement learning in safety-critical scenarios. In this paper, we present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios, accepting vector and vision-only input. Additionally, we offer a library of algorithms named Safe Policy Optimization (SafePO), comprising 16 state-of-the-art SafeRL algorithms. This comprehensive library can serve as a validation tool for the research community. By introducing this benchmark, we aim to facilitate the evaluation and comparison of safety performance, thus fostering the development of reinforcement learning for safer, more reliable, and responsible real-world applications. The website of this project can be accessed at https://sites.google.com/view/safety-gymnasium.
    Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study. (arXiv:2304.06762v2 [cs.CL] UPDATED)
    Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our implementation at: https://github.com/NVIDIA/Megatron-LM#retro.
    REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes. (arXiv:2310.12243v1 [cs.LG])
    Deep Learning models, such as those used in an autonomous vehicle are vulnerable to adversarial attacks where an attacker could place an adversarial object in the environment, leading to mis-classification. Generating these adversarial objects in the digital space has been extensively studied, however successfully transferring these attacks from the digital realm to the physical realm has proven challenging when controlling for real-world environmental factors. In response to these limitations, we introduce REVAMP, an easy-to-use Python library that is the first-of-its-kind tool for creating attack scenarios with arbitrary objects and simulating realistic environmental factors, lighting, reflection, and refraction. REVAMP enables researchers and practitioners to swiftly explore various scenarios within the digital realm by offering a wide range of configurable options for designing experiments and using differentiable rendering to reproduce physically plausible adversarial objects. We will demonstrate and invite the audience to try REVAMP to produce an adversarial texture on a chosen object while having control over various scene parameters. The audience will choose a scene, an object to attack, the desired attack class, and the number of camera positions to use. Then, in real time, we show how this altered texture causes the chosen object to be mis-classified, showcasing the potential of REVAMP in real-world scenarios. REVAMP is open-source and available at https://github.com/poloclub/revamp.
    Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models. (arXiv:2304.12526v2 [cs.CV] UPDATED)
    Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.
    On the Optimization and Generalization of Multi-head Attention. (arXiv:2310.12680v1 [cs.LG])
    The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.
    A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs. (arXiv:2310.12248v1 [cs.LG])
    Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes. Unlike prior approaches, our algorithm learns from sampled trajectories of the system and does not require prior knowledge of the system's topology.
    Personalized human mobility prediction for HuMob challenge. (arXiv:2310.12900v1 [cs.LG])
    We explain the methodology used to create the data submitted to HuMob Challenge, a data analysis competition for human mobility prediction. We adopted a personalized model to predict the individual's movement trajectory from their data, instead of predicting from the overall movement, based on the hypothesis that human movement is unique to each person. We devised the features such as the date and time, activity time, days of the week, time of day, and frequency of visits to POI (Point of Interest). As additional features, we incorporated the movement of other individuals with similar behavior patterns through the employment of clustering. The machine learning model we adopted was the Support Vector Regression (SVR). We performed accuracy through offline assessment and carried out feature selection and parameter tuning. Although overall dataset provided consists of 100,000 users trajectory, our method use only 20,000 target users data, and do not need to use other 80,000 data. Despite the personalized model's traditional feature engineering approach, this model yields reasonably good accuracy with lower computational cost.
    Denoising Heat-inspired Diffusion with Insulators for Collision Free Motion Planning. (arXiv:2310.12609v1 [cs.RO])
    Diffusion models have risen as a powerful tool in robotics due to their flexibility and multi-modality. While some of these methods effectively address complex problems, they often depend heavily on inference-time obstacle detection and require additional equipment. Addressing these challenges, we present a method that, during inference time, simultaneously generates only reachable goals and plans motions that avoid obstacles, all from a single visual input. Central to our approach is the novel use of a collision-avoiding diffusion kernel for training. Through evaluations against behavior-cloning and classical diffusion models, our framework has proven its robustness. It is particularly effective in multi-modal environments, navigating toward goals and avoiding unreachable ones blocked by obstacles, while ensuring collision avoidance.
    Towards Better Dynamic Graph Learning: New Architecture and Unified Library. (arXiv:2303.13047v3 [cs.LG] UPDATED)
    We propose DyGFormer, a new Transformer-based architecture for dynamic graph learning. DyGFormer is conceptually simple and only needs to learn from nodes' historical first-hop interactions by: (1) a neighbor co-occurrence encoding scheme that explores the correlations of the source node and destination node based on their historical sequences; (2) a patching technique that divides each sequence into multiple patches and feeds them to Transformer, allowing the model to effectively and efficiently benefit from longer histories. We also introduce DyGLib, a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols to promote reproducible, scalable, and credible dynamic graph learning research. By performing exhaustive experiments on thirteen datasets for dynamic link prediction and dynamic node classification tasks, we find that DyGFormer achieves state-of-the-art performance on most of the datasets, demonstrating its effectiveness in capturing nodes' correlations and long-term temporal dependencies. Moreover, some results of baselines are inconsistent with previous reports, which may be caused by their diverse but less rigorous implementations, showing the importance of DyGLib. All the used resources are publicly available at https://github.com/yule-BUAA/DyGLib.
    Stochastic Average Gradient : A Simple Empirical Investigation. (arXiv:2310.12771v1 [cs.LG])
    Despite the recent growth of theoretical studies and empirical successes of neural networks, gradient backpropagation is still the most widely used algorithm for training such networks. On the one hand, we have deterministic or full gradient (FG) approaches that have a cost proportional to the amount of training data used but have a linear convergence rate, and on the other hand, stochastic gradient (SG) methods that have a cost independent of the size of the dataset, but have a less optimal convergence rate than the determinist approaches. To combine the cost of the stochastic approach with the convergence rate of the deterministic approach, a stochastic average gradient (SAG) has been proposed. SAG is a method for optimizing the sum of a finite number of smooth convex functions. Like SG methods, the SAG method's iteration cost is independent of the number of terms in the sum. In this work, we propose to compare SAG to some standard optimizers used in machine learning. SAG converges faster than other optimizers on simple toy problems and performs better than many other optimizers on simple machine learning problems. We also propose a combination of SAG with the momentum algorithm and Adam. These combinations allow empirically higher speed and obtain better performance than the other methods, especially when the landscape of the function to optimize presents obstacles or is ill-conditioned.
    Approximate information maximization for bandit games. (arXiv:2310.12563v1 [stat.ML])
    Entropy maximization and free energy minimization are general physical principles for modeling the dynamics of various physical systems. Notable examples include modeling decision-making within the brain using the free-energy principle, optimizing the accuracy-complexity trade-off when accessing hidden variables with the information bottleneck principle (Tishby et al., 2000), and navigation in random environments using information maximization (Vergassola et al., 2007). Built on this principle, we propose a new class of bandit algorithms that maximize an approximation to the information of a key variable within the system. To this end, we develop an approximated analytical physics-based representation of an entropy to forecast the information gain of each action and greedily choose the one with the largest information gain. This method yields strong performances in classical bandit settings. Motivated by its empirical success, we prove its asymptotic optimality for the two-armed bandit problem with Gaussian rewards. Owing to its ability to encompass the system's properties in a global physical functional, this approach can be efficiently adapted to more complex bandit settings, calling for further investigation of information maximization approaches for multi-armed bandit problems.
    Testing the Consistency of Performance Scores Reported for Binary Classification Problems. (arXiv:2310.12527v1 [cs.LG])
    Binary classification is a fundamental task in machine learning, with applications spanning various scientific domains. Whether scientists are conducting fundamental research or refining practical applications, they typically assess and rank classification techniques based on performance metrics such as accuracy, sensitivity, and specificity. However, reported performance scores may not always serve as a reliable basis for research ranking. This can be attributed to undisclosed or unconventional practices related to cross-validation, typographical errors, and other factors. In a given experimental setup, with a specific number of positive and negative test items, most performance scores can assume specific, interrelated values. In this paper, we introduce numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup. Importantly, the proposed approach does not rely on statistical inference but uses numerical methods to identify inconsistencies with certainty. Through three different applications related to medicine, we demonstrate how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields. To benefit the scientific community, we have made the consistency tests available in an open-source Python package.
    A Scalable Test Problem Generator for Sequential Transfer Optimization. (arXiv:2304.08503v4 [cs.NE] UPDATED)
    Sequential transfer optimization (STO), which aims to improve the optimization performance on a task of interest by exploiting the knowledge captured from several previously-solved optimization tasks stored in a database, has been gaining increasing research attention over the years. However, despite the remarkable advances in algorithm design, the development of a systematic benchmark suite for comprehensive comparisons of STO algorithms received far less attention. Existing test problems are either simply generated by assembling other benchmark functions or extended from specific practical problems with limited scalability. The relationships between the optimal solutions of the source and target tasks in these problems are also often manually configured, limiting their ability to model different similarity relationships presented in real-world problems. Consequently, the good performance achieved by an algorithm on these problems might be biased and hard to be generalized to other problems. In light of the above, in this study, we first introduce four concepts for characterizing STO problems and present an important problem feature, namely similarity distribution, which quantitatively delineates the relationship between the optima of the source and target tasks. Then, we present the general design guidelines of STO problems and a particular STO problem generator with good scalability. Specifically, the similarity distribution of a problem can be easily customized, enabling a continuous spectrum of representation of the diverse similarity relationships of real-world problems. Lastly, a benchmark suite with 12 STO problems featured by a variety of customized similarity relationships is developed using the proposed generator. The source code of the problem generator is available at https://github.com/XmingHsueh/STOP-G.
    SemantIC: Semantic Interference Cancellation Towards 6G Wireless Communications. (arXiv:2310.12768v1 [eess.SP])
    This letter proposes a novel anti-interference technique, semantic interference cancellation (SemantIC), for enhancing information quality towards the sixth-generation (6G) wireless networks. SemantIC only requires the receiver to concatenate the channel decoder with a semantic auto-encoder. This constructs a turbo loop which iteratively and alternately eliminates noise in the signal domain and the semantic domain. From the viewpoint of network information theory, the neural network of the semantic auto-encoder stores side information by training, and provides side information in iterative decoding, as an implementation of the Wyner-Ziv theorem. Simulation results verify the performance improvement by SemantIC without extra channel resource cost.
    Automatic Hallucination Assessment for Aligned Large Language Models via Transferable Adversarial Attacks. (arXiv:2310.12516v1 [cs.CL])
    Although remarkable progress has been achieved in preventing large language model (LLM) hallucinations using instruction tuning and retrieval augmentation, it remains challenging to measure the reliability of LLMs using human-crafted evaluation data which is not available for many tasks and domains and could suffer from data leakage. Inspired by adversarial machine learning, this paper aims to develop a method of automatically generating evaluation data by appropriately modifying existing data on which LLMs behave faithfully. Specifically, this paper presents AutoDebug, an LLM-based framework to use prompting chaining to generate transferable adversarial attacks in the form of question-answering examples. We seek to understand the extent to which these examples trigger the hallucination behaviors of LLMs. We implement AutoDebug using ChatGPT and evaluate the resulting two variants of a popular open-domain question-answering dataset, Natural Questions (NQ), on a collection of open-source and proprietary LLMs under various prompting settings. Our generated evaluation data is human-readable and, as we show, humans can answer these modified questions well. Nevertheless, we observe pronounced accuracy drops across multiple LLMs including GPT-4. Our experimental results show that LLMs are likely to hallucinate in two categories of question-answering scenarios where (1) there are conflicts between knowledge given in the prompt and their parametric knowledge, or (2) the knowledge expressed in the prompt is complex. Finally, we find that the adversarial examples generated by our method are transferable across all considered LLMs. The examples generated by a small model can be used to debug a much larger model, making our approach cost-effective.
    Open-World Lifelong Graph Learning. (arXiv:2310.12565v1 [cs.LG])
    We study the problem of lifelong graph learning in an open-world scenario, where a model needs to deal with new tasks and potentially unknown classes. We utilize Out-of-Distribution (OOD) detection methods to recognize new classes and adapt existing non-graph OOD detection methods to graph data. Crucially, we suggest performing new class detection by combining OOD detection methods with information aggregated from the graph neighborhood. Most OOD detection methods avoid determining a crisp threshold for deciding whether a vertex is OOD. To tackle this problem, we propose a Weakly-supervised Relevance Feedback (Open-WRF) method, which decreases the sensitivity to thresholds in OOD detection. We evaluate our approach on six benchmark datasets. Our results show that the proposed neighborhood aggregation method for OOD scores outperforms existing methods independent of the underlying graph neural network. Furthermore, we demonstrate that our Open-WRF method is more robust to threshold selection and analyze the influence of graph neighborhood on OOD detection. The aggregation and threshold methods are compatible with arbitrary graph neural networks and OOD detection methods, making our approach versatile and applicable to many real-world applications.
    A Unifying Framework for Learning Argumentation Semantics. (arXiv:2310.12309v1 [cs.AI])
    Argumentation is a very active research field of Artificial Intelligence concerned with the representation and evaluation of arguments used in dialogues between humans and/or artificial agents. Acceptability semantics of formal argumentation systems define the criteria for the acceptance or rejection of arguments. Several software systems, known as argumentation solvers, have been developed to compute the accepted/rejected arguments using such criteria. These include systems that learn to identify the accepted arguments using non-interpretable methods. In this paper we present a novel framework, which uses an Inductive Logic Programming approach to learn the acceptability semantics for several abstract and structured argumentation frameworks in an interpretable way. Through an empirical evaluation we show that our framework outperforms existing argumentation solvers, thus opening up new future research directions in the area of formal argumentation and human-machine dialogues.
    Operator-Based Detecting, Learning, and Stabilizing Unstable Periodic Orbits of Chaotic Attractors. (arXiv:2310.12156v1 [nlin.AO])
    This paper examines the use of operator-theoretic approaches to the analysis of chaotic systems through the lens of their unstable periodic orbits (UPOs). Our approach involves three data-driven steps for detecting, identifying, and stabilizing UPOs. We demonstrate the use of kernel integral operators within delay coordinates as an innovative method for UPO detection. For identifying the dynamic behavior associated with each individual UPO, we utilize the Koopman operator to present the dynamics as linear equations in the space of Koopman eigenfunctions. This allows for characterizing the chaotic attractor by investigating its principal dynamical modes across varying UPOs. We extend this methodology into an interpretable machine learning framework aimed at stabilizing strange attractors on their UPOs. To illustrate the efficacy of our approach, we apply it to the Lorenz attractor as a case study.
    Category-Agnostic 6D Pose Estimation with Conditional Neural Processes. (arXiv:2206.07162v2 [cs.CV] UPDATED)
    We present a novel meta-learning approach for 6D pose estimation on unknown objects. In contrast to ``instance-level" and ``category-level" pose estimation methods, our algorithm learns object representation in a category-agnostic way, which endows it with strong generalization capabilities across object categories. Specifically, we employ a neural process-based meta-learning approach to train an encoder to capture texture and geometry of an object in a latent representation, based on very few RGB-D images and ground-truth keypoints. The latent representation is then used by a simultaneously meta-trained decoder to predict the 6D pose of the object in new images. Furthermore, we propose a novel geometry-aware decoder for the keypoint prediction using a Graph Neural Network (GNN), which explicitly takes geometric constraints specific to each object into consideration. To evaluate our algorithm, extensive experiments are conducted on the \linemod dataset, and on our new fully-annotated synthetic datasets generated from Multiple Categories in Multiple Scenes (MCMS). Experimental results demonstrate that our model performs well on unseen objects with very different shapes and appearances. Remarkably, our model also shows robust performance on occluded scenes although trained fully on data without occlusion. To our knowledge, this is the first work exploring \textbf{cross-category level} 6D pose estimation.
    Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models. (arXiv:2310.12568v1 [cs.LG])
    The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary objective of ML is to build models that can make accurate predictions on unseen data. Researchers aim to prove the existence of such generalizable models by evaluating performance using techniques such as cross-validation (CV), which uses systematic subsampling to estimate the generalization performance. Choosing a CV scheme and evaluating an ML pipeline can be challenging and, if used improperly, can lead to overestimated results and incorrect interpretations. We created julearn, an open-source Python library, that allow researchers to design and evaluate complex ML pipelines without encountering in common pitfalls. In this manuscript, we present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects that can be easily implemented using this novel library. Julearn aims to simplify the entry into the ML world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls. With its design, unique features and simple interface, it poses as a useful Python-based library for research projects.
    Differentiable Vertex Fitting for Jet Flavour Tagging. (arXiv:2310.12804v1 [hep-ex])
    We propose a differentiable vertex fitting algorithm that can be used for secondary vertex fitting, and that can be seamlessly integrated into neural networks for jet flavour tagging. Vertex fitting is formulated as an optimization problem where gradients of the optimized solution vertex are defined through implicit differentiation and can be passed to upstream or downstream neural network components for network training. More broadly, this is an application of differentiable programming to integrate physics knowledge into neural network models in high energy physics. We demonstrate how differentiable secondary vertex fitting can be integrated into larger transformer-based models for flavour tagging and improve heavy flavour jet classification.
    Learning threshold neurons via the "edge of stability". (arXiv:2212.07469v2 [cs.LG] UPDATED)
    Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.
    DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation. (arXiv:2310.12570v1 [eess.IV])
    Great progress has been made in automatic medical image segmentation due to powerful deep representation learning. The influence of transformer has led to research into its variants, and large-scale replacement of traditional CNN modules. However, such trend often overlooks the intrinsic feature extraction capabilities of the transformer and potential refinements to both the model and the transformer module through minor adjustments. This study proposes a novel deep medical image segmentation framework, called DA-TransUNet, aiming to introduce the Transformer and dual attention block into the encoder and decoder of the traditional U-shaped architecture. Unlike prior transformer-based solutions, our DA-TransUNet utilizes attention mechanism of transformer and multifaceted feature extraction of DA-Block, which can efficiently combine global, local, and multi-scale features to enhance medical image segmentation. Meanwhile, experimental results show that a dual attention block is added before the Transformer layer to facilitate feature extraction in the U-net structure. Furthermore, incorporating dual attention blocks in skip connections can enhance feature transfer to the decoder, thereby improving image segmentation performance. Experimental results across various benchmark of medical image segmentation reveal that DA-TransUNet significantly outperforms the state-of-the-art methods. The codes and parameters of our model will be publicly available at https://github.com/SUN-1024/DA-TransUnet.
    Canonical normalizing flows for manifold learning. (arXiv:2310.12743v1 [stat.ML])
    Manifold learning flows are a class of generative modelling techniques that assume a low-dimensional manifold description of the data. The embedding of such manifold into the high-dimensional space of the data is achieved via learnable invertible transformations. Therefore, once the manifold is properly aligned via a reconstruction loss, the probability density is tractable on the manifold and maximum likelihood can be used optimize the network parameters. Naturally, the lower-dimensional representation of the data requires an injective-mapping. Recent approaches were able to enforce that density aligns with the modelled manifold, while efficiently calculating the density volume-change term when embedding to the higher-dimensional space. However, unless the injective-mapping is analytically predefined, the learned manifold is not necessarily an efficient representation of the data. Namely, the latent dimensions of such models frequently learn an entangled intrinsic basis with degenerate information being stored in each dimension. Alternatively, if a locally orthogonal and/or sparse basis is to be learned, here coined canonical intrinsic basis, it can serve in learning a more compact latent space representation. Towards this end, we propose a canonical manifold learning flow method, where a novel optimization objective enforces the transformation matrix to have few prominent and orthogonal basis functions. Canonical manifold flow yields a more efficient use of the latent space, automatically generating fewer prominent and distinct dimensions to represent data, and consequently a better approximation of target distributions than other manifold flow methods in most experiments we conducted, resulting in lower FID scores.
    Transformer-based Entity Legal Form Classification. (arXiv:2310.12766v1 [cs.CL])
    We propose the application of Transformer-based language models for classifying entity legal forms from raw legal entity names. Specifically, we employ various BERT variants and compare their performance against multiple traditional baselines. Our evaluation encompasses a substantial subset of freely available Legal Entity Identifier (LEI) data, comprising over 1.1 million legal entities from 30 different legal jurisdictions. The ground truth labels for classification per jurisdiction are taken from the Entity Legal Form (ELF) code standard (ISO 20275). Our findings demonstrate that pre-trained BERT variants outperform traditional text classification approaches in terms of F1 score, while also performing comparably well in the Macro F1 Score. Moreover, the validity of our proposal is supported by the outcome of third-party expert reviews conducted in ten selected jurisdictions. This study highlights the significant potential of Transformer-based models in advancing data standardization and data integration. The presented approaches can greatly benefit financial institutions, corporations, governments and other organizations in assessing business relationships, understanding risk exposure, and promoting effective governance.
    Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach. (arXiv:2310.12428v1 [stat.ML])
    We initiate a novel approach to explain the out of sample performance of random forest (RF) models by exploiting the fact that any RF can be formulated as an adaptive weighted K nearest-neighbors model. Specifically, we use the proximity between points in the feature space learned by the RF to re-write random forest predictions exactly as a weighted average of the target labels of training data points. This linearity facilitates a local notion of explainability of RF predictions that generates attributions for any model prediction across observations in the training set, and thereby complements established methods like SHAP, which instead generates attributions for a model prediction across dimensions of the feature space. We demonstrate this approach in the context of a bond pricing model trained on US corporate bond trades, and compare our approach to various existing approaches to model explainability.
    AI Potentiality and Awareness: A Position Paper from the Perspective of Human-AI Teaming in Cybersecurity. (arXiv:2310.12162v1 [cs.CR])
    This position paper explores the broad landscape of AI potentiality in the context of cybersecurity, with a particular emphasis on its possible risk factors with awareness, which can be managed by incorporating human experts in the loop, i.e., "Human-AI" teaming. As artificial intelligence (AI) technologies advance, they will provide unparalleled opportunities for attack identification, incident response, and recovery. However, the successful deployment of AI into cybersecurity measures necessitates an in-depth understanding of its capabilities, challenges, and ethical and legal implications to handle associated risk factors in real-world application areas. Towards this, we emphasize the importance of a balanced approach that incorporates AI's computational power with human expertise. AI systems may proactively discover vulnerabilities and detect anomalies through pattern recognition, and predictive modeling, significantly enhancing speed and accuracy. Human experts can explain AI-generated decisions to stakeholders, regulators, and end-users in critical situations, ensuring responsibility and accountability, which helps establish trust in AI-driven security solutions. Therefore, in this position paper, we argue that human-AI teaming is worthwhile in cybersecurity, in which human expertise such as intuition, critical thinking, or contextual understanding is combined with AI's computational power to improve overall cyber defenses.
    Preliminary studies: Comparing LSTM and BLSTM Deep Neural Networks for Power Consumption Prediction. (arXiv:2305.16546v2 [cs.LG] UPDATED)
    Electric consumption prediction methods are investigated for many reasons such as decision-making related to energy efficiency as well as for anticipating demand in the energy market dynamics. The objective of the present work is the comparison between two Deep Learning models, namely the Long Short-Term Memory (LSTM) and Bi-directional LSTM (BLSTM) for univariate electric consumption Time Series (TS) short-term forecast. The Data Sets (DSs) were selected for their different contexts and scales, aiming the assessment of the models' robustness. Four DSs were used, related to the power consumption of: (a) a household in France; (b) a university building in Santar\'em, Brazil; (c) the T\'etouan city zones, in Morocco; and (c) the Singapore aggregated electric demand. The metrics RMSE, MAE, MAPE and R2 were calculated in a TS cross-validation scheme. The Friedman's test was applied to normalized RMSE (NRMSE) results, showing that BLSTM outperforms LSTM with statistically significant difference (p = 0.0455), corroborating the fact that bidirectional weight updating improves significantly the LSTM performance concerning different scales of electric power consumption.
    Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing. (arXiv:2310.12404v1 [cs.SD])
    Creating music is iterative, requiring varied methods at each stage. However, existing AI music systems fall short in orchestrating multiple subsystems for diverse needs. To address this gap, we introduce Loop Copilot, a novel system that enables users to generate and iteratively refine music through an interactive, multi-round dialogue interface. The system uses a large language model to interpret user intentions and select appropriate AI models for task execution. Each backend model is specialized for a specific task, and their outputs are aggregated to meet the user's requirements. To ensure musical coherence, essential attributes are maintained in a centralized table. We evaluate the effectiveness of the proposed system through semi-structured interviews and questionnaires, highlighting its utility not only in facilitating music creation but also its potential for broader applications.
    Knowledge from Uncertainty in Evidential Deep Learning. (arXiv:2310.12663v1 [cs.LG])
    This work reveals an evidential signal that emerges from the uncertainty value in Evidential Deep Learning (EDL). EDL is one example of a class of uncertainty-aware deep learning approaches designed to provide confidence (or epistemic uncertainty) about the current test sample. In particular for computer vision and bidirectional encoder large language models, the `evidential signal' arising from the Dirichlet strength in EDL can, in some cases, discriminate between classes, which is particularly strong when using large language models. We hypothesise that the KL regularisation term causes EDL to couple aleatoric and epistemic uncertainty. In this paper, we empirically investigate the correlations between misclassification and evaluated uncertainty, and show that EDL's `evidential signal' is due to misclassification bias. We critically evaluate EDL with other Dirichlet-based approaches, namely Generative Evidential Neural Networks (EDL-GEN) and Prior Networks, and show theoretically and empirically the differences between these loss functions. We conclude that EDL's coupling of uncertainty arises from these differences due to the use (or lack) of out-of-distribution samples during training.
    Rethinking Complex Queries on Knowledge Graphs with Neural Link Predictors. (arXiv:2304.07063v3 [cs.AI] UPDATED)
    Reasoning on knowledge graphs is a challenging task because it utilizes observed information to predict the missing one. Particularly, answering complex queries based on first-order logic is one of the crucial tasks to verify learning to reason abilities for generalization and composition. Recently, the prevailing method is query embedding which learns the embedding of a set of entities and treats logic operations as set operations and has shown great empirical success. Though there has been much research following the same formulation, many of its claims lack a formal and systematic inspection. In this paper, we rethink this formulation and justify many of the previous claims by characterizing the scope of queries investigated previously and precisely identifying the gap between its formulation and its goal, as well as providing complexity analysis for the currently investigated queries. Moreover, we develop a new dataset containing ten new types of queries with features that have never been considered and therefore can provide a thorough investigation of complex queries. Finally, we propose a new neural-symbolic method, Fuzzy Inference with Truth value (FIT), where we equip the neural link predictors with fuzzy logic theory to support end-to-end learning using complex queries with provable reasoning capability. Empirical results show that our method outperforms previous methods significantly in the new dataset and also surpasses previous methods in the existing dataset at the same time.
    2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision. (arXiv:2310.12817v1 [cs.CV])
    We present a Multimodal Interlaced Transformer (MIT) that jointly considers 2D and 3D data for weakly supervised point cloud segmentation. Research studies have shown that 2D and 3D features are complementary for point cloud segmentation. However, existing methods require extra 2D annotations to achieve 2D-3D information fusion. Considering the high annotation cost of point clouds, effective 2D and 3D feature fusion based on weakly supervised learning is in great demand. To this end, we propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation using only scene-level class tags. Specifically, the two encoders compute the self-attended features for 3D point clouds and 2D multi-view images, respectively. The decoder implements interlaced 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion. We alternately switch the roles of queries and key-value pairs in the decoder layers. It turns out that the 2D and 3D features are iteratively enriched by each other. Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods by a large margin on the S3DIS and ScanNet benchmarks. The project page will be available at https://jimmy15923.github.io/mit_web/.
    Model Merging by Uncertainty-Based Gradient Matching. (arXiv:2310.12808v1 [cs.LG])
    Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.
    Knowledge-Augmented Language Model Verification. (arXiv:2310.12836v1 [cs.CL])
    Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters. Yet, LMs often generate the factually incorrect responses to the given queries, since their knowledge may be inaccurate, incomplete, and outdated. To address this problem, previous works propose to augment LMs with the knowledge retrieved from an external knowledge source. However, such approaches often show suboptimal text generation performance due to two reasons: 1) the model may fail to retrieve the knowledge relevant to the given query, or 2) the model may not faithfully reflect the retrieved knowledge in the generated text. To overcome these, we propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier, which is a small LM that is trained to detect those two types of errors through instruction-finetuning. Then, when the verifier recognizes an error, we can rectify it by either retrieving new knowledge or generating new text. Further, we use an ensemble of the outputs from different instructions with a single verifier to enhance the reliability of the verification processes. We validate the effectiveness of the proposed verification steps on multiple question answering benchmarks, whose results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs. Our code is available at https://github.com/JinheonBaek/KALMV.
    AgentTuning: Enabling Generalized Agent Abilities for LLMs. (arXiv:2310.12823v1 [cs.CL])
    Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instruction-tuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs' agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct and AgentLM-7B, 13B, and 70B models at https://github.com/THUDM/AgentTuning , serving open and powerful alternatives to commercial LLMs for agent tasks.
    Learn from the Past: A Proxy based Adversarial Defense Framework to Boost Robustness. (arXiv:2310.12713v1 [cs.LG])
    In light of the vulnerability of deep learning models to adversarial samples and the ensuing security issues, a range of methods, including Adversarial Training (AT) as a prominent representative, aimed at enhancing model robustness against various adversarial attacks, have seen rapid development. However, existing methods essentially assist the current state of target model to defend against parameter-oriented adversarial attacks with explicit or implicit computation burdens, which also suffers from unstable convergence behavior due to inconsistency of optimization trajectories. Diverging from previous work, this paper reconsiders the update rule of target model and corresponding deficiency to defend based on its current state. By introducing the historical state of the target model as a proxy, which is endowed with much prior information for defense, we formulate a two-stage update rule, resulting in a general adversarial defense framework, which we refer to as `LAST' ({\bf L}earn from the P{\bf ast}). Besides, we devise a Self Distillation (SD) based defense objective to constrain the update process of the proxy model without the introduction of larger teacher models. Experimentally, we demonstrate consistent and significant performance enhancements by refining a series of single-step and multi-step AT methods (e.g., up to $\bf 9.2\%$ and $\bf 20.5\%$ improvement of Robust Accuracy (RA) on CIFAR10 and CIFAR100 datasets, respectively) across various datasets, backbones and attack modalities, and validate its ability to enhance training stability and ameliorate catastrophic overfitting issues meanwhile.
    OceanGPT: A Large Language Model for Ocean Science Tasks. (arXiv:2310.02031v3 [cs.CL] UPDATED)
    Ocean science, which delves into the oceans that are reservoirs of life and biodiversity, is of great significance given that oceans cover over 70% of our planet's surface. Recently, advances in Large Language Models (LLMs) have transformed the paradigm in science. Despite the success in other domains, current LLMs often fall short in catering to the needs of domain experts like oceanographers, and the potential of LLMs for ocean science is under-explored. The intrinsic reason may be the immense and intricate nature of ocean data as well as the necessity for higher granularity and richness in knowledge. To alleviate these issues, we introduce OceanGPT, the first-ever LLM in the ocean domain, which is expert in various ocean science tasks. We propose DoInstruct, a novel framework to automatically obtain a large volume of ocean domain instruction data, which generates instructions based on multi-agent collaboration. Additionally, we construct the first oceanography benchmark, OceanBench, to evaluate the capabilities of LLMs in the ocean domain. Though comprehensive experiments, OceanGPT not only shows a higher level of knowledge expertise for oceans science tasks but also gains preliminary embodied intelligence capabilities in ocean technology. Codes, data and checkpoints will soon be available at https://github.com/zjunlp/KnowLM.
    Causal Similarity-Based Hierarchical Bayesian Models. (arXiv:2310.12595v1 [cs.LG])
    The key challenge underlying machine learning is generalisation to new data. This work studies generalisation for datasets consisting of related tasks that may differ in causal mechanisms. For example, observational medical data for complex diseases suffers from heterogeneity in causal mechanisms of disease across patients, creating challenges for machine learning algorithms that need to generalise to new patients outside of the training dataset. Common approaches for learning supervised models with heterogeneous datasets include learning a global model for the entire dataset, learning local models for each tasks' data, or utilising hierarchical, meta-learning and multi-task learning approaches to learn how to generalise from data pooled across multiple tasks. In this paper we propose causal similarity-based hierarchical Bayesian models to improve generalisation to new tasks by learning how to pool data from training tasks with similar causal mechanisms. We apply this general modelling principle to Bayesian neural networks and compare a variety of methods for estimating causal task similarity (for both known and unknown causal models). We demonstrate the benefits of our approach and applicability to real world problems through a range of experiments on simulated and real data.
    Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights. (arXiv:2310.12462v1 [cs.LG])
    In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.
    Conditional Density Estimations from Privacy-Protected Data. (arXiv:2310.12781v1 [stat.ML])
    Many modern statistical analysis and machine learning applications require training models on sensitive user data. Differential privacy provides a formal guarantee that individual-level information about users does not leak. In this framework, randomized algorithms inject calibrated noise into the confidential data, resulting in privacy-protected datasets or queries. However, restricting access to only the privatized data during statistical analysis makes it computationally challenging to perform valid inferences on parameters underlying the confidential data. In this work, we propose simulation-based inference methods from privacy-protected datasets. Specifically, we use neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and on ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.
    Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared Pre-trained Language Models. (arXiv:2310.12818v1 [cs.CL])
    Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resource-constrained environments, enabling substantial reductions in model storage and memory costs without significant performance compromise. However, it is important to note that parameter sharing does not alleviate computational burdens associated with inference, thus impeding its practicality in situations characterized by limited stringent latency requirements or computational resources. Building upon neural ordinary differential equations (ODEs), we introduce a straightforward technique to enhance the inference efficiency of parameter-shared PLMs. Additionally, we propose a simple pre-training technique that leads to fully or partially shared models capable of achieving even greater inference acceleration. The experimental results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs, providing novel insights into more efficient utilization of parameter-shared models in resource-constrained settings.
    Fast Model Debias with Machine Unlearning. (arXiv:2310.12560v1 [cs.LG])
    Recent discoveries have revealed that deep neural networks might behave in a biased manner in many real-world scenarios. For instance, deep networks trained on a large-scale face recognition dataset CelebA tend to predict blonde hair for females and black hair for males. Such biases not only jeopardize the robustness of models but also perpetuate and amplify social biases, which is especially concerning for automated decision-making processes in healthcare, recruitment, etc., as they could exacerbate unfair economic and social inequalities among different groups. Existing debiasing methods suffer from high costs in bias labeling or model re-training, while also exhibiting a deficiency in terms of elucidating the origins of biases within the model. To this respect, we propose a fast model debiasing framework (FMD) which offers an efficient approach to identify, evaluate and remove biases inherent in trained models. The FMD identifies biased attributes through an explicit counterfactual concept and quantifies the influence of data samples with influence functions. Moreover, we design a machine unlearning-based strategy to efficiently and effectively remove the bias in a trained model with a small counterfactual dataset. Experiments on the Colored MNIST, CelebA, and Adult Income datasets along with experiments with large language models demonstrate that our method achieves superior or competing accuracies compared with state-of-the-art methods while attaining significantly fewer biases and requiring much less debiasing cost. Notably, our method requires only a small external dataset and updating a minimal amount of model parameters, without the requirement of access to training data that may be too large or unavailable in practice.
    An Improved Metarounding Algorithm via Frank-Wolfe. (arXiv:2310.12629v1 [cs.DS])
    Metarounding is an approach to convert an approximation algorithm for linear optimization over some combinatorial classes to an online linear optimization algorithm for the same class. We propose a new metarounding algorithm under a natural assumption that a relax-based approximation algorithm exists for the combinatorial class. Our algorithm is much more efficient in both theoretical and practical aspects.
    Label-Aware Automatic Verbalizer for Few-Shot Text Classification. (arXiv:2310.12778v1 [cs.CL])
    Prompt-based learning has shown its effectiveness in few-shot text classification. One important factor in its success is a verbalizer, which translates output from a language model into a predicted class. Notably, the simplest and widely acknowledged verbalizer employs manual labels to represent the classes. However, manual selection does not guarantee the optimality of the selected words when conditioned on the chosen language model. Therefore, we propose Label-Aware Automatic Verbalizer (LAAV), effectively augmenting the manual labels to achieve better few-shot classification results. Specifically, we use the manual labels along with the conjunction "and" to induce the model to generate more effective words for the verbalizer. The experimental results on five datasets across five languages demonstrate that LAAV significantly outperforms existing verbalizers. Furthermore, our analysis reveals that LAAV suggests more relevant words compared to similar approaches, especially in mid-to-low resource languages.
    MTS-LOF: Medical Time-Series Representation Learning via Occlusion-Invariant Features. (arXiv:2310.12451v1 [cs.LG])
    Medical time series data are indispensable in healthcare, providing critical insights for disease diagnosis, treatment planning, and patient management. The exponential growth in data complexity, driven by advanced sensor technologies, has presented challenges related to data labeling. Self-supervised learning (SSL) has emerged as a transformative approach to address these challenges, eliminating the need for extensive human annotation. In this study, we introduce a novel framework for Medical Time Series Representation Learning, known as MTS-LOF. MTS-LOF leverages the strengths of contrastive learning and Masked Autoencoder (MAE) methods, offering a unique approach to representation learning for medical time series data. By combining these techniques, MTS-LOF enhances the potential of healthcare applications by providing more sophisticated, context-rich representations. Additionally, MTS-LOF employs a multi-masking strategy to facilitate occlusion-invariant feature learning. This approach allows the model to create multiple views of the data by masking portions of it. By minimizing the discrepancy between the representations of these masked patches and the fully visible patches, MTS-LOF learns to capture rich contextual information within medical time series datasets. The results of experiments conducted on diverse medical time series datasets demonstrate the superiority of MTS-LOF over other methods. These findings hold promise for significantly enhancing healthcare applications by improving representation learning. Furthermore, our work delves into the integration of joint-embedding SSL and MAE techniques, shedding light on the intricate interplay between temporal and structural dependencies in healthcare data. This understanding is crucial, as it allows us to grasp the complexities of healthcare data analysis.
    Improved Operator Learning by Orthogonal Attention. (arXiv:2310.12487v1 [cs.LG])
    Neural operators, as an efficient surrogate model for learning the solutions of PDEs, have received extensive attention in the field of scientific machine learning. Among them, attention-based neural operators have become one of the mainstreams in related research. However, existing approaches overfit the limited training data due to the considerable number of parameters in the attention mechanism. To address this, we develop an orthogonal attention based on the eigendecomposition of the kernel integral operator and the neural approximation of eigenfunctions. The orthogonalization naturally poses a proper regularization effect on the resulting neural operator, which aids in resisting overfitting and boosting generalization. Experiments on six standard neural operator benchmark datasets comprising both regular and irregular geometries show that our method can outperform competing baselines with decent margins.
    Attack Prompt Generation for Red Teaming and Defending Large Language Models. (arXiv:2310.12505v1 [cs.CL])
    Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. Previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. To address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. Specifically, considering the impressive capabilities of newly emerged LLMs, we propose an attack framework to instruct LLMs to mimic human-generated prompts through in-context learning. Furthermore, we propose a defense framework that fine-tunes victim LLMs through iterative interactions with the attack framework to enhance their safety against red teaming attacks. Extensive experiments on different LLMs validate the effectiveness of our proposed attack and defense frameworks. Additionally, we release a series of attack prompts datasets named SAP with varying sizes, facilitating the safety evaluation and enhancement of more LLMs. Our code and dataset is available on https://github.com/Aatrox103/SAP .
    Piecewise Deterministic Markov Processes for Bayesian Neural Networks. (arXiv:2302.08724v2 [stat.ML] UPDATED)
    Inference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subsampling, though introduce a model specific inhomogenous Poisson Process (IPPs) which is difficult to sample from. This work introduces a new generic and adaptive thinning scheme for sampling from these IPPs, and demonstrates how this approach can accelerate the application of PDMPs for inference in BNNs. Experimentation illustrates how inference with these methods is computationally feasible, can improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when compared against other approximate inference schemes.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v4 [stat.ML] UPDATED)
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    OODRobustBench: benchmarking and analyzing adversarial robustness under distribution shift. (arXiv:2310.12793v1 [cs.LG])
    Existing works have made great progress in improving adversarial robustness, but typically test their method only on data from the same distribution as the training data, i.e. in-distribution (ID) testing. As a result, it is unclear how such robustness generalizes under input distribution shifts, i.e. out-of-distribution (OOD) testing. This is a concerning omission as such distribution shifts are unavoidable when methods are deployed in the wild. To address this issue we propose a benchmark named OODRobustBench to comprehensively assess OOD adversarial robustness using 23 dataset-wise shifts (i.e. naturalistic shifts in input distribution) and 6 threat-wise shifts (i.e., unforeseen adversarial threat models). OODRobustBench is used to assess 706 robust models using 60.7K adversarial evaluations. This large-scale analysis shows that: 1) adversarial robustness suffers from a severe OOD generalization issue; 2) ID robustness correlates strongly with OOD robustness, in a positive linear way, under many distribution shifts. The latter enables the prediction of OOD robustness from ID robustness. Based on this, we are able to predict the upper limit of OOD robustness for existing robust training schemes. The results suggest that achieving OOD robustness requires designing novel methods beyond the conventional ones. Last, we discover that extra data, data augmentation, advanced model architectures and particular regularization approaches can improve OOD robustness. Noticeably, the discovered training schemes, compared to the baseline, exhibit dramatically higher robustness under threat shift while keeping high ID robustness, demonstrating new promising solutions for robustness against both multi-attack and unforeseen attacks.
    Energy-Based Models For Speech Synthesis. (arXiv:2310.12765v1 [cs.SD])
    Recently there has been a lot of interest in non-autoregressive (non-AR) models for speech synthesis, such as FastSpeech 2 and diffusion models. Unlike AR models, these models do not have autoregressive dependencies among outputs which makes inference efficient. This paper expands the range of available non-AR models with another member called energy-based models (EBMs). The paper describes how noise contrastive estimation, which relies on the comparison between positive and negative samples, can be used to train EBMs. It proposes a number of strategies for generating effective negative samples, including using high-performing AR models. It also describes how sampling from EBMs can be performed using Langevin Markov Chain Monte-Carlo (MCMC). The use of Langevin MCMC enables to draw connections between EBMs and currently popular diffusion models. Experiments on LJSpeech dataset show that the proposed approach offers improvements over Tacotron 2.
    Constructing Impactful Machine Learning Research for Astronomy: Best Practices for Researchers and Reviewers. (arXiv:2310.12528v1 [astro-ph.IM])
    Machine learning has rapidly become a tool of choice for the astronomical community. It is being applied across a wide range of wavelengths and problems, from the classification of transients to neural network emulators of cosmological simulations, and is shifting paradigms about how we generate and report scientific results. At the same time, this class of method comes with its own set of best practices, challenges, and drawbacks, which, at present, are often reported on incompletely in the astrophysical literature. With this paper, we aim to provide a primer to the astronomical community, including authors, reviewers, and editors, on how to implement machine learning models and report their results in a way that ensures the accuracy of the results, reproducibility of the findings, and usefulness of the method.
    Document-Level Language Models for Machine Translation. (arXiv:2310.12303v1 [cs.CL])
    Despite the known limitations, most machine translation systems today still operate on the sentence-level. One reason for this is, that most parallel training data is only sentence-level aligned, without document-level meta information available. In this work, we set out to build context-aware translation systems utilizing document-level monolingual data instead. This can be achieved by combining any existing sentence-level translation model with a document-level language model. We improve existing approaches by leveraging recent advancements in model combination. Additionally, we propose novel weighting techniques that make the system combination more flexible and significantly reduce computational overhead. In a comprehensive evaluation on four diverse translation tasks, we show that our extensions improve document-targeted scores substantially and are also computationally more efficient. However, we also find that in most scenarios, back-translation gives even better results, at the cost of having to re-train the translation system. Finally, we explore language model fusion in the light of recent advancements in large language models. Our findings suggest that there might be strong potential in utilizing large language models via model combination.  ( 2 min )
    Improving SCGAN's Similarity Constraint and Learning a Better Disentangled Representation. (arXiv:2310.12262v1 [cs.CV])
    SCGAN adds a similarity constraint between generated images and conditions as a regularization term on generative adversarial networks. Similarity constraint works as a tutor to instruct the generator network to comprehend the difference of representations based on conditions. We understand how SCGAN works on a deeper level. This understanding makes us realize that the similarity constraint functions like the contrastive loss function. We believe that a model with high understanding and intelligence measures the similarity between images based on their structure and high level features, just like humans do. Two major changes we applied to SCGAN in order to make a modified model are using SSIM to measure similarity between images and applying contrastive loss principles to the similarity constraint. The modified model performs better using FID and FactorVAE metrics. The modified model also has better generalisability compared to other models. Keywords Generative Adversarial Nets, Unsupervised Learning, Disentangled Representation Learning, Contrastive Disentanglement, SSIM  ( 2 min )
    American Option Pricing using Self-Attention GRU and Shapley Value Interpretation. (arXiv:2310.12500v1 [q-fin.PR])
    Options, serving as a crucial financial instrument, are used by investors to manage and mitigate their investment risks within the securities market. Precisely predicting the present price of an option enables investors to make informed and efficient decisions. In this paper, we propose a machine learning method for forecasting the prices of SPY (ETF) option based on gated recurrent unit (GRU) and self-attention mechanism. We first partitioned the raw dataset into 15 subsets according to moneyness and days to maturity criteria. For each subset, we matched the corresponding U.S. government bond rates and Implied Volatility Indices. This segmentation allows for a more insightful exploration of the impacts of risk-free rates and underlying volatility on option pricing. Next, we built four different machine learning models, including multilayer perceptron (MLP), long short-term memory (LSTM), self-attention LSTM, and self-attention GRU in comparison to the traditional binomial model. The empirical result shows that self-attention GRU with historical data outperforms other models due to its ability to capture complex temporal dependencies and leverage the contextual information embedded in the historical data. Finally, in order to unveil the "black box" of artificial intelligence, we employed the SHapley Additive exPlanations (SHAP) method to interpret and analyze the prediction results of the self-attention GRU model with historical data. This provides insights into the significance and contributions of different input features on the pricing of American-style options.  ( 2 min )
    Opportunities for Adaptive Experiments to Enable Continuous Improvement that Trades-off Instructor and Researcher Incentives. (arXiv:2310.12324v1 [cs.HC])
    Randomized experimental comparisons of alternative pedagogical strategies could provide useful empirical evidence in instructors' decision-making. However, traditional experiments do not have a clear and simple pathway to using data rapidly to try to increase the chances that students in an experiment get the best conditions. Drawing inspiration from the use of machine learning and experimentation in product development at leading technology companies, we explore how adaptive experimentation might help in continuous course improvement. In adaptive experiments, as different arms/conditions are deployed to students, data is analyzed and used to change the experience for future students. This can be done using machine learning algorithms to identify which actions are more promising for improving student experience or outcomes. This algorithm can then dynamically deploy the most effective conditions to future students, resulting in better support for students' needs. We illustrate the approach with a case study providing a side-by-side comparison of traditional and adaptive experimentation of self-explanation prompts in online homework problems in a CS1 course. This provides a first step in exploring the future of how this methodology can be useful in bridging research and practice in doing continuous improvement.  ( 2 min )
    WeaveNet for Approximating Two-sided Matching Problems. (arXiv:2310.12515v1 [cs.LG])
    Matching, a task to optimally assign limited resources under constraints, is a fundamental technology for society. The task potentially has various objectives, conditions, and constraints; however, the efficient neural network architecture for matching is underexplored. This paper proposes a novel graph neural network (GNN), \textit{WeaveNet}, designed for bipartite graphs. Since a bipartite graph is generally dense, general GNN architectures lose node-wise information by over-smoothing when deeply stacked. Such a phenomenon is undesirable for solving matching problems. WeaveNet avoids it by preserving edge-wise information while passing messages densely to reach a better solution. To evaluate the model, we approximated one of the \textit{strongly NP-hard} problems, \textit{fair stable matching}. Despite its inherent difficulties and the network's general purpose design, our model reached a comparative performance with state-of-the-art algorithms specially designed for stable matching for small numbers of agents.  ( 2 min )
    Enhanced Graph Neural Networks with Ego-Centric Spectral Subgraph Embeddings Augmentation. (arXiv:2310.12169v1 [cs.SI])
    Graph Neural Networks (GNNs) have shown remarkable merit in performing various learning-based tasks in complex networks. The superior performance of GNNs often correlates with the availability and quality of node-level features in the input networks. However, for many network applications, such node-level information may be missing or unreliable, thereby limiting the applicability and efficacy of GNNs. To address this limitation, we present a novel approach denoted as Ego-centric Spectral subGraph Embedding Augmentation (ESGEA), which aims to enhance and design node features, particularly in scenarios where information is lacking. Our method leverages the topological structure of the local subgraph to create topology-aware node features. The subgraph features are generated using an efficient spectral graph embedding technique, and they serve as node features that capture the local topological organization of the network. The explicit node features, if present, are then enhanced with the subgraph embeddings in order to improve the overall performance. ESGEA is compatible with any GNN-based architecture and is effective even in the absence of node features. We evaluate the proposed method in a social network graph classification task where node attributes are unavailable, as well as in a node classification task where node features are corrupted or even absent. The evaluation results on seven datasets and eight baseline models indicate up to a 10% improvement in AUC and a 7% improvement in accuracy for graph and node classification tasks, respectively.  ( 3 min )
    Equipping Federated Graph Neural Networks with Structure-aware Group Fairness. (arXiv:2310.12350v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely used for various types of graph data processing and analytical tasks in different domains. Training GNNs over centralized graph data can be infeasible due to privacy concerns and regulatory restrictions. Thus, federated learning (FL) becomes a trending solution to address this challenge in a distributed learning paradigm. However, as GNNs may inherit historical bias from training data and lead to discriminatory predictions, the bias of local models can be easily propagated to the global model in distributed settings. This poses a new challenge in mitigating bias in federated GNNs. To address this challenge, we propose $\text{F}^2$GNN, a Fair Federated Graph Neural Network, that enhances group fairness of federated GNNs. As bias can be sourced from both data and learning algorithms, $\text{F}^2$GNN aims to mitigate both types of bias under federated settings. First, we provide theoretical insights on the connection between data bias in a training graph and statistical fairness metrics of the trained GNN models. Based on the theoretical analysis, we design $\text{F}^2$GNN which contains two key components: a fairness-aware local model update scheme that enhances group fairness of the local models on the client side, and a fairness-weighted global model update scheme that takes both data bias and fairness metrics of local models into consideration in the aggregation process. We evaluate $\text{F}^2$GNN empirically versus a number of baseline methods, and demonstrate that $\text{F}^2$GNN outperforms these baselines in terms of both fairness and model accuracy.  ( 3 min )
    Few-Shot In-Context Imitation Learning via Implicit Graph Alignment. (arXiv:2310.12238v1 [cs.RO])
    Consider the following problem: given a few demonstrations of a task across a few different objects, how can a robot learn to perform that same task on new, previously unseen objects? This is challenging because the large variety of objects within a class makes it difficult to infer the task-relevant relationship between the new objects and the objects in the demonstrations. We address this by formulating imitation learning as a conditional alignment problem between graph representations of objects. Consequently, we show that this conditioning allows for in-context learning, where a robot can perform a task on a set of new objects immediately after the demonstrations, without any prior knowledge about the object class or any further training. In our experiments, we explore and validate our design choices, and we show that our method is highly effective for few-shot learning of several real-world, everyday tasks, whilst outperforming baselines. Videos are available on our project webpage at https://www.robot-learning.uk/implicit-graph-alignment.  ( 2 min )
    ClusT3: Information Invariant Test-Time Training. (arXiv:2310.12345v1 [cs.CV])
    Deep Learning models have shown remarkable performance in a broad range of vision tasks. However, they are often vulnerable against domain shifts at test-time. Test-time training (TTT) methods have been developed in an attempt to mitigate these vulnerabilities, where a secondary task is solved at training time simultaneously with the main task, to be later used as an self-supervised proxy task at test-time. In this work, we propose a novel unsupervised TTT technique based on the maximization of Mutual Information between multi-scale feature maps and a discrete latent representation, which can be integrated to the standard training as an auxiliary clustering task. Experimental results demonstrate competitive classification performance on different popular test-time adaptation benchmarks.  ( 2 min )
    Learning to Solve Climate Sensor Placement Problems with a Transformer. (arXiv:2310.12387v1 [cs.LG])
    The optimal placement of sensors for environmental monitoring and disaster management is a challenging problem due to its NP-hard nature. Traditional methods for sensor placement involve exact, approximation, or heuristic approaches, with the latter being the most widely used. However, heuristic methods are limited by expert intuition and experience. Deep learning (DL) has emerged as a promising approach for generating heuristic algorithms automatically. In this paper, we introduce a novel sensor placement approach focused on learning improvement heuristics using deep reinforcement learning (RL) methods. Our approach leverages an RL formulation for learning improvement heuristics, driven by an actor-critic algorithm for training the policy network. We compare our method with several state-of-the-art approaches by conducting comprehensive experiments, demonstrating the effectiveness and superiority of our proposed approach in producing high-quality solutions. Our work presents a promising direction for applying advanced DL and RL techniques to challenging climate sensor placement problems.  ( 2 min )
    Enhancing the Performance of Automated Grade Prediction in MOOC using Graph Representation Learning. (arXiv:2310.12281v1 [cs.LG])
    In recent years, Massive Open Online Courses (MOOCs) have gained significant traction as a rapidly growing phenomenon in online learning. Unlike traditional classrooms, MOOCs offer a unique opportunity to cater to a diverse audience from different backgrounds and geographical locations. Renowned universities and MOOC-specific providers, such as Coursera, offer MOOC courses on various subjects. Automated assessment tasks like grade and early dropout predictions are necessary due to the high enrollment and limited direct interaction between teachers and learners. However, current automated assessment approaches overlook the structural links between different entities involved in the downstream tasks, such as the students and courses. Our hypothesis suggests that these structural relationships, manifested through an interaction graph, contain valuable information that can enhance the performance of the task at hand. To validate this, we construct a unique knowledge graph for a large MOOC dataset, which will be publicly available to the research community. Furthermore, we utilize graph embedding techniques to extract latent structural information encoded in the interactions between entities in the dataset. These techniques do not require ground truth labels and can be utilized for various tasks. Finally, by combining entity-specific features, behavioral features, and extracted structural features, we enhance the performance of predictive machine learning models in student assignment grade prediction. Our experiments demonstrate that structural features can significantly improve the predictive performance of downstream assessment tasks. The code and data are available in \url{https://github.com/DSAatUSU/MOOPer_grade_prediction}  ( 3 min )
    MARVEL: Multi-Agent Reinforcement-Learning for Large-Scale Variable Speed Limits. (arXiv:2310.12359v1 [cs.MA])
    Variable speed limit (VSL) control is a promising traffic management strategy for enhancing safety and mobility. This work introduces MARVEL, a multi-agent reinforcement learning (MARL) framework for implementing large-scale VSL control on freeway corridors using only commonly available data. The agents learn through a reward structure that incorporates adaptability to traffic conditions, safety, and mobility; enabling coordination among the agents. The proposed framework scales to cover corridors with many gantries thanks to a parameter sharing among all VSL agents. The agents are trained in a microsimulation environment based on a short freeway stretch with 8 gantries spanning 7 miles and tested with 34 gantries spanning 17 miles of I-24 near Nashville, TN. MARVEL improves traffic safety by 63.4% compared to the no control scenario and enhances traffic mobility by 14.6% compared to a state-of-the-practice algorithm that has been deployed on I-24. An explainability analysis is undertaken to explore the learned policy under different traffic conditions and the results provide insights into the decision-making process of agents. Finally, we test the policy learned from the simulation-based experiments on real input data from I-24 to illustrate the potential deployment capability of the learned policy.  ( 2 min )
    Architectural Implications of GNN Aggregation Programming Abstractions. (arXiv:2310.12184v1 [cs.LG])
    Graph neural networks (GNNs) have gained significant popularity due to the powerful capability to extract useful representations from graph data. As the need for efficient GNN computation intensifies, a variety of programming abstractions designed for optimizing GNN Aggregation have emerged to facilitate acceleration. However, there is no comprehensive evaluation and analysis upon existing abstractions, thus no clear consensus on which approach is better. In this letter, we classify existing programming abstractions for GNN Aggregation by the dimension of data organization and propagation method. By constructing these abstractions on a state-of-the-art GNN library, we perform a thorough and detailed characterization study to compare their performance and efficiency, and provide several insights on future GNN acceleration based on our analysis.  ( 2 min )
    Balanced Group Convolution: An Improved Group Convolution Based on Approximability Estimates. (arXiv:2310.12461v1 [cs.LG])
    The performance of neural networks has been significantly improved by increasing the number of channels in convolutional layers. However, this increase in performance comes with a higher computational cost, resulting in numerous studies focused on reducing it. One promising approach to address this issue is group convolution, which effectively reduces the computational cost by grouping channels. However, to the best of our knowledge, there has been no theoretical analysis on how well the group convolution approximates the standard convolution. In this paper, we mathematically analyze the approximation of the group convolution to the standard convolution with respect to the number of groups. Furthermore, we propose a novel variant of the group convolution called balanced group convolution, which shows a higher approximation with a small additional computational cost. We provide experimental results that validate our theoretical findings and demonstrate the superior performance of the balanced group convolution over other variants of group convolution.  ( 2 min )
    RK-core: An Established Methodology for Exploring the Hierarchical Structure within Datasets. (arXiv:2310.12168v1 [cs.LG])
    Recently, the field of machine learning has undergone a transition from model-centric to data-centric. The advancements in diverse learning tasks have been propelled by the accumulation of more extensive datasets, subsequently facilitating the training of larger models on these datasets. However, these datasets remain relatively under-explored. To this end, we introduce a pioneering approach known as RK-core, to empower gaining a deeper understanding of the intricate hierarchical structure within datasets. Across several benchmark datasets, we find that samples with low coreness values appear less representative of their respective categories, and conversely, those with high coreness values exhibit greater representativeness. Correspondingly, samples with high coreness values make a more substantial contribution to the performance in comparison to those with low coreness values. Building upon this, we further employ RK-core to analyze the hierarchical structure of samples with different coreset selection methods. Remarkably, we find that a high-quality coreset should exhibit hierarchical diversity instead of solely opting for representative samples. The code is available at https://github.com/yaolu-zjut/Kcore.  ( 2 min )
    Open-Set Multivariate Time-Series Anomaly Detection. (arXiv:2310.12294v1 [cs.LG])
    Numerous methods for time series anomaly detection (TSAD) methods have emerged in recent years. Most existing methods are unsupervised and assume the availability of normal training samples only, while few supervised methods have shown superior performance by incorporating labeled anomalous samples in the training phase. However, certain anomaly types are inherently challenging for unsupervised methods to differentiate from normal data, while supervised methods are constrained to detecting anomalies resembling those present during training, failing to generalize to unseen anomaly classes. This paper is the first attempt in providing a novel approach for the open-set TSAD problem, in which a small number of labeled anomalies from a limited class of anomalies are visible in the training phase, with the objective of detecting both seen and unseen anomaly classes in the test phase. The proposed method, called Multivariate Open-Set timeseries Anomaly Detection (MOSAD) consists of three primary modules: a Feature Extractor to extract meaningful time-series features; a Multi-head Network consisting of Generative-, Deviation-, and Contrastive heads for capturing both seen and unseen anomaly classes; and an Anomaly Scoring module leveraging the insights of the three heads to detect anomalies. Extensive experiments on three real-world datasets consistently show that our approach surpasses existing methods under various experimental settings, thus establishing a new state-of-the-art performance in the TSAD field.  ( 2 min )
    Preference Optimization for Molecular Language Models. (arXiv:2310.12304v1 [stat.ML])
    Molecular language modeling is an effective approach to generating novel chemical structures. However, these models do not \emph{a priori} encode certain preferences a chemist may desire. We investigate the use of fine-tuning using Direct Preference Optimization to better align generated molecules with chemist preferences. Our findings suggest that this approach is simple, efficient, and highly effective.  ( 2 min )
    SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation. (arXiv:2310.12508v1 [cs.LG])
    With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today's AI models. However, existing MU methods focusing on data and/or weight perspectives often grapple with limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of 'weight saliency' in MU, drawing parallels with input saliency in model explanation. This innovation directs MU's attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with 'exact' unlearning (model retraining from scratch after removing the forgetting dataset). To the best of our knowledge, SalUn is the first principled MU approach adaptable enough to effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation. For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not.  ( 2 min )
    SDGym: Low-Code Reinforcement Learning Environments using System Dynamics Models. (arXiv:2310.12494v1 [cs.LG])
    Understanding the long-term impact of algorithmic interventions on society is vital to achieving responsible AI. Traditional evaluation strategies often fall short due to the complex, adaptive and dynamic nature of society. While reinforcement learning (RL) can be a powerful approach for optimizing decisions in dynamic settings, the difficulty of realistic environment design remains a barrier to building robust agents that perform well in practical settings. To address this issue we tap into the field of system dynamics (SD) as a complementary method that incorporates collaborative simulation model specification practices. We introduce SDGym, a low-code library built on the OpenAI Gym framework which enables the generation of custom RL environments based on SD simulation models. Through a feasibility study we validate that well specified, rich RL environments can be generated from preexisting SD models and a few lines of configuration code. We demonstrate the capabilities of the SDGym environment using an SD model of the electric vehicle adoption problem. We compare two SD simulators, PySD and BPTK-Py for parity, and train a D4PG agent using the Acme framework to showcase learning and environment interaction. Our preliminary findings underscore the dual potential of SD to improve RL environment design and for RL to improve dynamic policy discovery within SD models. By open-sourcing SDGym, the intent is to galvanize further research and promote adoption across the SD and RL communities, thereby catalyzing collaboration in this emerging interdisciplinary space.  ( 2 min )
    Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization. (arXiv:2310.12298v1 [cs.LG])
    Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.  ( 2 min )
    CAT: Closed-loop Adversarial Training for Safe End-to-End Driving. (arXiv:2310.12432v1 [cs.LG])
    Driving safety is a top priority for autonomous vehicles. Orthogonal to prior work handling accident-prone traffic events by algorithm designs at the policy level, we investigate a Closed-loop Adversarial Training (CAT) framework for safe end-to-end driving in this paper through the lens of environment augmentation. CAT aims to continuously improve the safety of driving agents by training the agent on safety-critical scenarios that are dynamically generated over time. A novel resampling technique is developed to turn log-replay real-world driving scenarios into safety-critical ones via probabilistic factorization, where the adversarial traffic generation is modeled as the multiplication of standard motion prediction sub-problems. Consequently, CAT can launch more efficient physical attacks compared to existing safety-critical scenario generation methods and yields a significantly less computational cost in the iterative learning pipeline. We incorporate CAT into the MetaDrive simulator and validate our approach on hundreds of driving scenarios imported from real-world driving datasets. Experimental results demonstrate that CAT can effectively generate adversarial scenarios countering the agent being trained. After training, the agent can achieve superior driving safety in both log-replay and safety-critical traffic scenarios on the held-out test set. Code and data are available at https://metadriverse.github.io/cat.  ( 2 min )
  • Open

    Symmetric Neural-Collapse Representations with Supervised Contrastive Loss: The Impact of ReLU and Batching. (arXiv:2306.07960v2 [cs.LG] UPDATED)
    Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy loss for classification. While prior studies have demonstrated that both losses yield symmetric training representations under balanced data, this symmetry breaks under class imbalances. This paper presents an intriguing discovery: the introduction of a ReLU activation at the final layer effectively restores the symmetry in SCL-learned representations. We arrive at this finding analytically, by establishing that the global minimizers of an unconstrained features model with SCL loss and entry-wise non-negativity constraints form an orthogonal frame. Extensive experiments conducted across various datasets, architectures, and imbalance scenarios corroborate our finding. Importantly, our experiments reveal that the inclusion of the ReLU activation restores symmetry without compromising test accuracy. This constitutes the first geometry characterization of SCL under imbalances. Additionally, our analysis and experiments underscore the pivotal role of batch selection strategies in representation geometry. By proving necessary and sufficient conditions for mini-batch choices that ensure invariant symmetric representations, we introduce batch-binding as an efficient strategy that guarantees these conditions hold.  ( 2 min )
    A Computational Framework for Solving Wasserstein Lagrangian Flows. (arXiv:2310.10649v2 [cs.LG] CROSS LISTED)
    The dynamical formulation of the optimal transport can be extended through various choices of the underlying geometry ($\textit{kinetic energy}$), and the regularization of density paths ($\textit{potential energy}$). These combinations yield different variational problems ($\textit{Lagrangians}$), encompassing many variations of the optimal transport problem such as the Schr\"odinger bridge, unbalanced optimal transport, and optimal transport with physical constraints, among others. In general, the optimal density path is unknown, and solving these variational problems can be computationally challenging. Leveraging the dual formulation of the Lagrangians, we propose a novel deep learning based framework approaching all of these problems from a unified perspective. Our method does not require simulating or backpropagating through the trajectories of the learned dynamics, and does not need access to optimal couplings. We showcase the versatility of the proposed framework by outperforming previous approaches for the single-cell trajectory inference, where incorporating prior knowledge into the dynamics is crucial for correct predictions.  ( 2 min )
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v4 [stat.ML] UPDATED)
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.  ( 3 min )
    Communication-Efficient On-Device Machine Learning: Federated Distillation and Augmentation under Non-IID Private Data. (arXiv:1811.11479v2 [cs.LG] UPDATED)
    On-device machine learning (ML) enables the training process to exploit a massive amount of user-generated private data samples. To enjoy this benefit, inter-device communication overhead should be minimized. With this end, we propose federated distillation (FD), a distributed model training algorithm whose communication payload size is much smaller than a benchmark scheme, federated learning (FL), particularly when the model size is large. Moreover, user-generated data samples are likely to become non-IID across devices, which commonly degrades the performance compared to the case with an IID dataset. To cope with this, we propose federated augmentation (FAug), where each device collectively trains a generative model, and thereby augments its local data towards yielding an IID dataset. Empirical studies demonstrate that FD with FAug yields around 26x less communication overhead while achieving 95-98% test accuracy compared to FL.  ( 2 min )
    Optimality Guarantees for Particle Belief Approximation of POMDPs. (arXiv:2210.05015v5 [cs.AI] UPDATED)
    Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems. However, POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid, which is often the case for physical systems. While recent online sampling-based POMDP algorithms that plan with observation likelihood weighting have shown practical effectiveness, a general theory characterizing the approximation error of the particle filtering techniques that these algorithms use has not previously been proposed. Our main contribution is bounding the error between any POMDP and its corresponding finite sample particle belief MDP (PB-MDP) approximation. This fundamental bridge between PB-MDPs and POMDPs allows us to adapt any sampling-based MDP algorithm to a POMDP by solving the corresponding particle belief MDP, thereby extending the convergence guarantees of the MDP algorithm to the POMDP. Practically, this is implemented by using the particle filter belief transition model as the generative model for the MDP solver. While this requires access to the observation density model from the POMDP, it only increases the transition sampling complexity of the MDP solver by a factor of $\mathcal{O}(C)$, where $C$ is the number of particles. Thus, when combined with sparse sampling MDP algorithms, this approach can yield algorithms for POMDPs that have no direct theoretical dependence on the size of the state and observation spaces. In addition to our theoretical contribution, we perform five numerical experiments on benchmark POMDPs to demonstrate that a simple MDP algorithm adapted using PB-MDP approximation, Sparse-PFT, achieves performance competitive with other leading continuous observation POMDP solvers.  ( 3 min )
    Variational Inference for SDEs Driven by Fractional Noise. (arXiv:2310.12975v1 [cs.LG])
    We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,-an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception.  ( 3 min )
    The Kernel Density Integral Transformation. (arXiv:2309.10194v2 [stat.ML] UPDATED)
    Feature preprocessing continues to play a critical role when applying machine learning and statistical methods to tabular data. In this paper, we propose the use of the kernel density integral transformation as a feature preprocessing step. Our approach subsumes the two leading feature preprocessing methods as limiting cases: linear min-max scaling and quantile transformation. We demonstrate that, without hyperparameter tuning, the kernel density integral transformation can be used as a simple drop-in replacement for either method, offering protection from the weaknesses of each. Alternatively, with tuning of a single continuous hyperparameter, we frequently outperform both of these methods. Finally, we show that the kernel density transformation can be profitably applied to statistical data analysis, particularly in correlation analysis and univariate clustering.  ( 2 min )
    PAC Prediction Sets Under Label Shift. (arXiv:2310.12964v1 [stat.ML])
    Prediction sets capture uncertainty by predicting sets of labels rather than individual labels, enabling downstream decisions to conservatively account for all plausible outcomes. Conformal inference algorithms construct prediction sets guaranteed to contain the true label with high probability. These guarantees fail to hold in the face of distribution shift, which is precisely when reliable uncertainty quantification can be most useful. We propose a novel algorithm for constructing prediction sets with PAC guarantees in the label shift setting. This method estimates the predicted probabilities of the classes in a target domain, as well as the confusion matrix, then propagates uncertainty in these estimates through a Gaussian elimination algorithm to compute confidence intervals for importance weights. Finally, it uses these intervals to construct prediction sets. We evaluate our approach on five datasets: the CIFAR-10, ChestX-Ray and Entity-13 image datasets, the tabular CDC Heart dataset, and the AGNews text dataset. Our algorithm satisfies the PAC guarantee while producing smaller, more informative, prediction sets compared to several baselines.  ( 2 min )
    The Adaptive $\tau$-Lasso: Robustness and Oracle Properties. (arXiv:2304.09310v2 [stat.ML] UPDATED)
    This paper introduces a new regularized version of the robust $\tau$-regression estimator for analyzing high-dimensional datasets subject to gross contamination in the response variables and covariates (explanatory variables). The resulting estimator, termed adaptive $\tau$-Lasso, is robust to outliers and high-leverage points. It also incorporates an adaptive $\ell_1$-norm penalty term, which enables the selection of relevant variables and reduces the bias associated with large true regression coefficients. More specifically, this adaptive $\ell_1$-norm penalty term assigns a weight to each regression coefficient. For a fixed number of predictors $p$, we show that the adaptive $\tau$-Lasso has the oracle property, ensuring both variable-selection consistency and asymptotic normality. Asymptotic normality applies only to the entries of the regression vector corresponding to the true support, assuming knowledge of the true regression vector support. We characterize its robustness via the finite-sample breakdown point and the influence function. We carry out extensive simulations and observe that the class of $\tau$-Lasso estimators exhibits robustness and reliable performance in both contaminated and uncontaminated data settings. We also validate our theoretical findings on robustness properties through simulation experiments. In the face of outliers and high-leverage points, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators achieve the best performance or close-to-best performance in terms of prediction and variable selection accuracy compared to other competing regularized estimators for all scenarios considered in this study. Therefore, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators can be effectively employed for a variety of sparse linear regression problems, particularly in high-dimensional settings and when the data is contaminated by outliers and high-leverage points.  ( 3 min )
    URL: A Representation Learning Benchmark for Transferable Uncertainty Estimates. (arXiv:2307.03810v2 [cs.LG] UPDATED)
    Representation learning has significantly driven the field to develop pretrained models that can act as a valuable starting point when transferring to new datasets. With the rising demand for reliable machine learning and uncertainty quantification, there is a need for pretrained models that not only provide embeddings but also transferable uncertainty estimates. To guide the development of such models, we propose the Uncertainty-aware Representation Learning (URL) benchmark. Besides the transferability of the representations, it also measures the zero-shot transferability of the uncertainty estimate using a novel metric. We apply URL to evaluate eleven uncertainty quantifiers that are pretrained on ImageNet and transferred to eight downstream datasets. We find that approaches that focus on the uncertainty of the representation itself or estimate the prediction risk directly outperform those that are based on the probabilities of upstream classes. Yet, achieving transferable uncertainty quantification remains an open challenge. Our findings indicate that it is not necessarily in conflict with traditional representation learning goals. Code is provided under https://github.com/mkirchhof/url .  ( 2 min )
    Sequential Gibbs Posteriors with Applications to Principal Component Analysis. (arXiv:2310.12882v1 [stat.ME])
    Gibbs posteriors are proportional to a prior distribution multiplied by an exponentiated loss function, with a key tuning parameter weighting information in the loss relative to the prior and providing a control of posterior uncertainty. Gibbs posteriors provide a principled framework for likelihood-free Bayesian inference, but in many situations, including a single tuning parameter inevitably leads to poor uncertainty quantification. In particular, regardless of the value of the parameter, credible regions have far from the nominal frequentist coverage even in large samples. We propose a sequential extension to Gibbs posteriors to address this problem. We prove the proposed sequential posterior exhibits concentration and a Bernstein-von Mises theorem, which holds under easy to verify conditions in Euclidean space and on manifolds. As a byproduct, we obtain the first Bernstein-von Mises theorem for traditional likelihood-based Bayesian posteriors on manifolds. All methods are illustrated with an application to principal component analysis.  ( 2 min )
    A path-norm toolkit for modern networks: consequences, promises and challenges. (arXiv:2310.01225v2 [stat.ML] UPDATED)
    This work introduces the first toolkit around path-norms that is fully able to encompass general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on feedforward networks compared to the product of operators' norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.  ( 2 min )
    Evaluating Superhuman Models with Consistency Checks. (arXiv:2306.09983v3 [cs.LG] UPDATED)
    If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We instantiate our framework on three tasks where correctness of decisions is hard to evaluate due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions, forecasting future events, and making legal judgments. We show that regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making. For example: a chess engine assigning opposing valuations to semantically identical boards; GPT-4 forecasting that sports records will evolve non-monotonically over time; or an AI judge assigning bail to a defendant only after we add a felony to their criminal record.  ( 2 min )
    Model-agnostic variable importance for predictive uncertainty: an entropy-based approach. (arXiv:2310.12842v1 [stat.ML])
    In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the model's level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model's predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches in understanding both the sources of uncertainty and their impact on model performance.  ( 2 min )
    Log-density gradient covariance and automatic metric tensors for Riemann manifold Monte Carlo methods. (arXiv:2211.01746v2 [stat.CO] UPDATED)
    A metric tensor for Riemann manifold Monte Carlo particularly suited for non-linear Bayesian hierarchical models is proposed. The metric tensor is built from symmetric positive semidefinite log-density gradient covariance (LGC) matrices, which are also proposed and further explored here. The LGCs generalize the Fisher information matrix by measuring the joint information content and dependence structure of both a random variable and the parameters of said variable. Consequently, positive definite Fisher/LGC-based metric tensors may be constructed not only from the observation likelihoods as is current practice, but also from arbitrarily complicated non-linear prior/latent variable structures, provided the LGC may be derived for each conditional distribution used to construct said structures. The proposed methodology is highly automatic and allows for exploitation of any sparsity associated with the model in question. When implemented in conjunction with a Riemann manifold variant of the recently proposed numerical generalized randomized Hamiltonian Monte Carlo processes, the proposed methodology is highly competitive, in particular for the more challenging target distributions associated with Bayesian hierarchical models.  ( 2 min )
    Physics-informed neural networks in the recreation of hydrodynamic simulations from dark matter. (arXiv:2303.14090v2 [astro-ph.CO] UPDATED)
    Physics-informed neural networks have emerged as a coherent framework for building predictive models that combine statistical patterns with domain knowledge. The underlying notion is to enrich the optimization loss function with known relationships to constrain the space of possible solutions. Hydrodynamic simulations are a core constituent of modern cosmology, while the required computations are both expensive and time-consuming. At the same time, the comparatively fast simulation of dark matter requires fewer resources, which has led to the emergence of machine learning algorithms for baryon inpainting as an active area of research; here, recreating the scatter found in hydrodynamic simulations is an ongoing challenge. This paper presents the first application of physics-informed neural networks to baryon inpainting by combining advances in neural network architectures with physical constraints, injecting theory on baryon conversion efficiency into the model loss function. We also introduce a punitive prediction comparison based on the Kullback-Leibler divergence, which enforces scatter reproduction. By simultaneously extracting the complete set of baryonic properties for the Simba suite of cosmological simulations, our results demonstrate improved accuracy of baryonic predictions based on dark matter halo properties, successful recovery of the fundamental metallicity relation, and retrieve scatter that traces the target simulation's distribution.  ( 3 min )
    EDGI: Equivariant Diffusion for Planning with Embodied Agents. (arXiv:2303.12410v2 [cs.LG] UPDATED)
    Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group SE(3), the discrete-time translation group Z, and the object permutation group Sn. EDGI follows the Diffuser framework (Janner et al., 2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new SE(3)xZxSn-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier guidance let us softly break the symmetry for specific tasks as needed. On object manipulation and navigation tasks, EDGI is substantially more sample efficient and generalizes better across the symmetry group than non-equivariant models.  ( 2 min )
    Generative Flow Networks as Entropy-Regularized RL. (arXiv:2310.12934v1 [cs.LG])
    The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to a general case. We demonstrate how the task of learning a generative flow network can be efficiently redefined as an entropy-regularized RL problem with a specific reward and regularizer structure. Furthermore, we illustrate the practical efficiency of this reformulation by applying standard soft RL algorithms to GFlowNet training across several probabilistic modeling tasks. Contrary to previously reported results, we show that entropic RL approaches can be competitive against established GFlowNet training methods. This perspective opens a direct path for integrating reinforcement learning principles into the realm of generative flow networks.  ( 2 min )
    Piecewise Deterministic Markov Processes for Bayesian Neural Networks. (arXiv:2302.08724v2 [stat.ML] UPDATED)
    Inference on modern Bayesian Neural Networks (BNNs) often relies on a variational inference treatment, imposing violated assumptions of independence and the form of the posterior. Traditional MCMC approaches avoid these assumptions at the cost of increased computation due to its incompatibility to subsampling of the likelihood. New Piecewise Deterministic Markov Process (PDMP) samplers permit subsampling, though introduce a model specific inhomogenous Poisson Process (IPPs) which is difficult to sample from. This work introduces a new generic and adaptive thinning scheme for sampling from these IPPs, and demonstrates how this approach can accelerate the application of PDMPs for inference in BNNs. Experimentation illustrates how inference with these methods is computationally feasible, can improve predictive accuracy, MCMC mixing performance, and provide informative uncertainty measurements when compared against other approximate inference schemes.  ( 2 min )
    Deep Discriminative to Kernel Density Networks for Calibrated Inference. (arXiv:2201.13001v6 [cs.LG] UPDATED)
    Deep discriminative approaches like random forests and deep neural networks have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring confidence calibration for both in-distribution and out-of-distribution data points. Many popular methods for in-distribution (ID) calibration, such as isotonic regression and Platt's sigmoidal regression, exhibit excellent ID calibration performance but often at the cost of classification accuracy. Moreover, these methods are not calibrated for the entire feature space, leading to overconfidence in the case of out-of-distribution (OOD) samples. In this paper, we leveraged the fact that deep models, including both random forests and deep-nets, learn internal representations which are unions of polytopes with affine activation functions to conceptualize them both as partitioning rules of the feature space. We replace the affine function in each polytope populated by the training data with a Gaussian kernel. We propose sufficient conditions for our proposed methods to be consistent estimators of the corresponding class conditional densities. Moreover, our experiments on both tabular and vision benchmarks show that the proposed approaches obtain well-calibrated posteriors while mostly preserving or improving the classification accuracy of the original algorithm for in-distribution region, and extrapolates beyond the training data to handle out-of-distribution inputs appropriately.  ( 3 min )
    Neurosymbolic Grounding for Compositional World Models. (arXiv:2310.12690v1 [cs.LG])
    We introduce Cosmos, a framework for object-centric world modeling that is designed for compositional generalization (CG), i.e., high performance on unseen input scenes obtained through the composition of known visual "atoms." The central insight behind Cosmos is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. Cosmos is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity's symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CG on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CG in world modeling.  ( 2 min )
    DCSI -- An improved measure of cluster separability based on separation and connectedness. (arXiv:2310.12806v1 [stat.ML])
    Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. A review of the existing literature shows that neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate the central aspects of separability for density-based clustering: between-class separation and within-class connectedness. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not form meaningful clusters.  ( 2 min )
    Compression of Recurrent Neural Networks using Matrix Factorization. (arXiv:2310.12688v1 [cs.LG])
    Compressing neural networks is a key step when deploying models for real-time or embedded applications. Factorizing the model's matrices using low-rank approximations is a promising method for achieving compression. While it is possible to set the rank before training, this approach is neither flexible nor optimal. In this work, we propose a post-training rank-selection method called Rank-Tuning that selects a different rank for each matrix. Used in combination with training adaptations, our method achieves high compression rates with no or little performance degradation. Our numerical experiments on signal processing tasks show that we can compress recurrent neural networks up to 14x with at most 1.4% relative performance reduction.  ( 2 min )
    Conditional Density Estimations from Privacy-Protected Data. (arXiv:2310.12781v1 [stat.ML])
    Many modern statistical analysis and machine learning applications require training models on sensitive user data. Differential privacy provides a formal guarantee that individual-level information about users does not leak. In this framework, randomized algorithms inject calibrated noise into the confidential data, resulting in privacy-protected datasets or queries. However, restricting access to only the privatized data during statistical analysis makes it computationally challenging to perform valid inferences on parameters underlying the confidential data. In this work, we propose simulation-based inference methods from privacy-protected datasets. Specifically, we use neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and on ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.  ( 2 min )
    Causal Similarity-Based Hierarchical Bayesian Models. (arXiv:2310.12595v1 [cs.LG])
    The key challenge underlying machine learning is generalisation to new data. This work studies generalisation for datasets consisting of related tasks that may differ in causal mechanisms. For example, observational medical data for complex diseases suffers from heterogeneity in causal mechanisms of disease across patients, creating challenges for machine learning algorithms that need to generalise to new patients outside of the training dataset. Common approaches for learning supervised models with heterogeneous datasets include learning a global model for the entire dataset, learning local models for each tasks' data, or utilising hierarchical, meta-learning and multi-task learning approaches to learn how to generalise from data pooled across multiple tasks. In this paper we propose causal similarity-based hierarchical Bayesian models to improve generalisation to new tasks by learning how to pool data from training tasks with similar causal mechanisms. We apply this general modelling principle to Bayesian neural networks and compare a variety of methods for estimating causal task similarity (for both known and unknown causal models). We demonstrate the benefits of our approach and applicability to real world problems through a range of experiments on simulated and real data.  ( 2 min )
    STANLEY: Stochastic Gradient Anisotropic Langevin Dynamics for Learning Energy-Based Models. (arXiv:2310.12667v1 [stat.ML])
    We propose in this paper, STANLEY, a STochastic gradient ANisotropic LangEvin dYnamics, for sampling high dimensional data. With the growing efficacy and potential of Energy-Based modeling, also known as non-normalized probabilistic modeling, for modeling a generative process of different natures of high dimensional data observations, we present an end-to-end learning algorithm for Energy-Based models (EBM) with the purpose of improving the quality of the resulting sampled data points. While the unknown normalizing constant of EBMs makes the training procedure intractable, resorting to Markov Chain Monte Carlo (MCMC) is in general a viable option. Realizing what MCMC entails for the EBM training, we propose in this paper, a novel high dimensional sampling method, based on an anisotropic stepsize and a gradient-informed covariance matrix, embedded into a discretized Langevin diffusion. We motivate the necessity for an anisotropic update of the negative samples in the Markov Chain by the nonlinearity of the backbone of the EBM, here a Convolutional Neural Network. Our resulting method, namely STANLEY, is an optimization algorithm for training Energy-Based models via our newly introduced MCMC method. We provide a theoretical understanding of our sampling scheme by proving that the sampler leads to a geometrically uniformly ergodic Markov Chain. Several image generation experiments are provided in our paper to show the effectiveness of our method.  ( 2 min )
    Generating collective counterfactual explanations in score-based classification via mathematical optimization. (arXiv:2310.12822v1 [stat.ML])
    Due to the increasing use of Machine Learning models in high stakes decision making settings, it has become increasingly important to have tools to understand how models arrive at decisions. Assuming a trained Supervised Classification model, explanations can be obtained via counterfactual analysis: a counterfactual explanation of an instance indicates how this instance should be minimally modified so that the perturbed instance is classified in the desired class by the Machine Learning classification model. Most of the Counterfactual Analysis literature focuses on the single-instance single-counterfactual setting, in which the analysis is done for one single instance to provide one single explanation. Taking a stakeholder's perspective, in this paper we introduce the so-called collective counterfactual explanations. By means of novel Mathematical Optimization models, we provide a counterfactual explanation for each instance in a group of interest, so that the total cost of the perturbations is minimized under some linking constraints. Making the process of constructing counterfactuals collective instead of individual enables us to detect the features that are critical to the entire dataset to have the individuals classified in the desired class. Our methodology allows for some instances to be treated individually, performing the collective counterfactual analysis for a fraction of records of the group of interest. This way, outliers are identified and handled appropriately. Under some assumptions on the classifier and the space in which counterfactuals are sought, finding collective counterfactuals is reduced to solving a convex quadratic linearly constrained mixed integer optimization problem, which, for datasets of moderate size, can be solved to optimality using existing solvers. The performance of our approach is illustrated on real-world datasets, demonstrating its usefulness.  ( 3 min )
    On the Optimization and Generalization of Multi-head Attention. (arXiv:2310.12680v1 [cs.LG])
    The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits of overparameterization when training fully-connected networks, we investigate the potential optimization and generalization advantages of using multiple attention heads. Towards this goal, we derive convergence and generalization guarantees for gradient-descent training of a single-layer multi-head self-attention model, under a suitable realizability condition on the data. We then establish primitive conditions on the initialization that ensure realizability holds. Finally, we demonstrate that these conditions are satisfied for a simple tokenized-mixture model. We expect the analysis can be extended to various data-model and architecture variations.  ( 2 min )
    How a student becomes a teacher: learning and forgetting through Spectral methods. (arXiv:2310.12612v1 [cs.LG])
    In theoretical ML, the teacher-student paradigm is often employed as an effective metaphor for real-life tuition. The above scheme proves particularly relevant when the student network is overparameterized as compared to the teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits.  ( 3 min )
    Constrained Reweighting of Distributions: an Optimal Transport Approach. (arXiv:2310.12447v1 [stat.ML])
    We commonly encounter the problem of identifying an optimally weight adjusted version of the empirical distribution of observed data, adhering to predefined constraints on the weights. Such constraints often manifest as restrictions on the moments, tail behaviour, shapes, number of modes, etc., of the resulting weight adjusted empirical distribution. In this article, we substantially enhance the flexibility of such methodology by introducing a nonparametrically imbued distributional constraints on the weights, and developing a general framework leveraging the maximum entropy principle and tools from optimal transport. The key idea is to ensure that the maximum entropy weight adjusted empirical distribution of the observed data is close to a pre-specified probability distribution in terms of the optimal transport metric while allowing for subtle departures. The versatility of the framework is demonstrated in the context of three disparate applications where data re-weighting is warranted to satisfy side constraints on the optimization problem at the heart of the statistical task: namely, portfolio allocation, semi-parametric inference for complex surveys, and ensuring algorithmic fairness in machine learning algorithms.  ( 2 min )
    Canonical normalizing flows for manifold learning. (arXiv:2310.12743v1 [stat.ML])
    Manifold learning flows are a class of generative modelling techniques that assume a low-dimensional manifold description of the data. The embedding of such manifold into the high-dimensional space of the data is achieved via learnable invertible transformations. Therefore, once the manifold is properly aligned via a reconstruction loss, the probability density is tractable on the manifold and maximum likelihood can be used optimize the network parameters. Naturally, the lower-dimensional representation of the data requires an injective-mapping. Recent approaches were able to enforce that density aligns with the modelled manifold, while efficiently calculating the density volume-change term when embedding to the higher-dimensional space. However, unless the injective-mapping is analytically predefined, the learned manifold is not necessarily an efficient representation of the data. Namely, the latent dimensions of such models frequently learn an entangled intrinsic basis with degenerate information being stored in each dimension. Alternatively, if a locally orthogonal and/or sparse basis is to be learned, here coined canonical intrinsic basis, it can serve in learning a more compact latent space representation. Towards this end, we propose a canonical manifold learning flow method, where a novel optimization objective enforces the transformation matrix to have few prominent and orthogonal basis functions. Canonical manifold flow yields a more efficient use of the latent space, automatically generating fewer prominent and distinct dimensions to represent data, and consequently a better approximation of target distributions than other manifold flow methods in most experiments we conducted, resulting in lower FID scores.  ( 2 min )
    Approximate information maximization for bandit games. (arXiv:2310.12563v1 [stat.ML])
    Entropy maximization and free energy minimization are general physical principles for modeling the dynamics of various physical systems. Notable examples include modeling decision-making within the brain using the free-energy principle, optimizing the accuracy-complexity trade-off when accessing hidden variables with the information bottleneck principle (Tishby et al., 2000), and navigation in random environments using information maximization (Vergassola et al., 2007). Built on this principle, we propose a new class of bandit algorithms that maximize an approximation to the information of a key variable within the system. To this end, we develop an approximated analytical physics-based representation of an entropy to forecast the information gain of each action and greedily choose the one with the largest information gain. This method yields strong performances in classical bandit settings. Motivated by its empirical success, we prove its asymptotic optimality for the two-armed bandit problem with Gaussian rewards. Owing to its ability to encompass the system's properties in a global physical functional, this approach can be efficiently adapted to more complex bandit settings, calling for further investigation of information maximization approaches for multi-armed bandit problems.  ( 2 min )
    Optimal Excess Risk Bounds for Empirical Risk Minimization on $p$-norm Linear Regression. (arXiv:2310.12437v1 [math.ST])
    We study the performance of empirical risk minimization on the $p$-norm linear regression problem for $p \in (1, \infty)$. We show that, in the realizable case, under no moment assumptions, and up to a distribution-dependent constant, $O(d)$ samples are enough to exactly recover the target. Otherwise, for $p \in [2, \infty)$, and under weak moment assumptions on the target and the covariates, we prove a high probability excess risk bound on the empirical risk minimizer whose leading term matches, up to a constant that depends only on $p$, the asymptotically exact rate. We extend this result to the case $p \in (1, 2)$ under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.  ( 2 min )
    Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers. (arXiv:2310.12553v1 [cs.LG])
    The quality of explanations for the predictions of complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how correctly the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both insertion and deletion scores of the explanations while keeping their predictive accuracy. Since the original insertion and deletion metrics are indifferentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics to be differentiable and use them to formalize insertion and deletion metric-based regularizers. The experimental results on image and tabular datasets show that the deep neural networks-based predictors fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easy-to-interpret explanations while keeping high predictive accuracy.  ( 2 min )
    Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights. (arXiv:2310.12462v1 [cs.LG])
    In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.  ( 2 min )
    Neural Likelihood Approximation for Integer Valued Time Series Data. (arXiv:2310.12544v1 [stat.ML])
    Stochastic processes defined on integer valued state spaces are popular within the physical and biological sciences. These models are necessary for capturing the dynamics of small systems where the individual nature of the populations cannot be ignored and stochastic effects are important. The inference of the parameters of such models, from time series data, is difficult due to intractability of the likelihood; current methods, based on simulations of the underlying model, can be so computationally expensive as to be prohibitive. In this paper we construct a neural likelihood approximation for integer valued time series data using causal convolutions, which allows us to evaluate the likelihood of the whole time series in parallel. We demonstrate our method by performing inference on a number of ecological and epidemiological models, showing that we can accurately approximate the true posterior while achieving significant computational speed ups in situations where current methods struggle.  ( 2 min )
    Closed-Form Diffusion Models. (arXiv:2310.12395v1 [cs.LG])
    Score-based generative models (SGMs) sample from a target distribution by iteratively transforming noise using the score function of the perturbed target. For any finite training set, this score function can be evaluated in closed form, but the resulting SGM memorizes its training data and does not generate novel samples. In practice, one approximates the score by training a neural network via score-matching. The error in this approximation promotes generalization, but neural SGMs are costly to train and sample, and the effective regularization this error provides is not well-understood theoretically. In this work, we instead explicitly smooth the closed-form score to obtain an SGM that generates novel samples without training. We analyze our model and propose an efficient nearest-neighbor-based estimator of its score function. Using this estimator, our method achieves sampling times competitive with neural SGMs while running on consumer-grade CPUs.  ( 2 min )
    Towards Enhanced Local Explainability of Random Forests: a Proximity-Based Approach. (arXiv:2310.12428v1 [stat.ML])
    We initiate a novel approach to explain the out of sample performance of random forest (RF) models by exploiting the fact that any RF can be formulated as an adaptive weighted K nearest-neighbors model. Specifically, we use the proximity between points in the feature space learned by the RF to re-write random forest predictions exactly as a weighted average of the target labels of training data points. This linearity facilitates a local notion of explainability of RF predictions that generates attributions for any model prediction across observations in the training set, and thereby complements established methods like SHAP, which instead generates attributions for a model prediction across dimensions of the feature space. We demonstrate this approach in the context of a bond pricing model trained on US corporate bond trades, and compare our approach to various existing approaches to model explainability.  ( 2 min )
    Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm. (arXiv:2310.12285v1 [stat.ME])
    High-dimensional longitudinal data is increasingly used in a wide range of scientific studies. However, there are few statistical methods for high-dimensional linear mixed models (LMMs), as most Bayesian variable selection or penalization methods are designed for independent observations. Additionally, the few available software packages for high-dimensional LMMs suffer from scalability issues. This work presents an efficient and accurate Bayesian framework for high-dimensional LMMs. We use empirical Bayes estimators of hyperparameters for increased flexibility and an Expectation-Conditional-Minimization (ECM) algorithm for computationally efficient maximum a posteriori probability (MAP) estimation of parameters. The novelty of the approach lies in its partitioning and parameter expansion as well as its fast and scalable computation. We illustrate Linear Mixed Modeling with PaRtitiOned empirical Bayes ECM (LMM-PROBE) in simulation studies evaluating fixed and random effects estimation along with computation time. A real-world example is provided using data from a study of lupus in children, where we identify genes and clinical factors associated with a new lupus biomarker and predict the biomarker over time.  ( 2 min )
    Preference Optimization for Molecular Language Models. (arXiv:2310.12304v1 [stat.ML])
    Molecular language modeling is an effective approach to generating novel chemical structures. However, these models do not \emph{a priori} encode certain preferences a chemist may desire. We investigate the use of fine-tuning using Direct Preference Optimization to better align generated molecules with chemist preferences. Our findings suggest that this approach is simple, efficient, and highly effective.  ( 2 min )

  • Open

    [D] Is lang chain the right solution?
    Hello, I would love to have an LLm that can provide answers (in chat format) based some of the sql db data we have. Want it for an internal company project. I am by no means an expert but decent in programming and want to build a system to get answers in chat format. My understanding is that training LLMs ground up is prohibitively expensive and langchains are sort of hybrid , efficient solutions. Please suggest any other solutions. Also would Langchain being a company and not open source pose a problem in terms of copyrights? Thanks! submitted by /u/betelgeuseian [link] [comments]  ( 9 min )
    [R] MemGPT: Towards LLMs as Operating Systems - UC Berkeley 2023 - Is able to create unbounded/infinite LLM context!
    Paper: https://arxiv.org/abs/2310.08560 Github: https://github.com/cpacker/MemGPT Blog: https://memgpt.ai/ Youtube: https://youtu.be/QQ2QOPWZKVc?si=_bSSXU9EQE0FP64h MemGPT 🧠 Giving AI Unlimited Prompt Size (Big Step Towards AGI?) by Metthew Berman / Must watch and he also explains how to install it! Overview LLMs are increasingly being used for perpetual chats Limited context lengths makes perpetual chat challenging MemGPT manages a virtual context (inspired by virtual memory in operating systems) to create unbounded LLM context With MemGPT, we demonstrate that LLMs can be taught to manage their own memory! Abstract: Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversa…  ( 9 min )
    [D] Some beginner questions about Whisper for transcription
    Hi, I am a mac user. I am trying to use whisper.cpp downloaded from its github file. I don't know much about phyton or coding so I basically followed this guide to install and use it. I downloaded the large model to try it. I am using it for non-English languages and I want to use it for language learning purposes so I can understand what is being said in an Instagram story or a Youtube video (without subtitles) or a tv series or an extract of movie etc. I was using Macwhisper but I wanted to try the pro features and I don't want to pay for it (for now) and try the pro models for non-English languages. My question is: all of my files that I want to transcribe are video files with .mp4 extension. Can I also transcribe those with whisper? If not, and if I can only transcribe audio files, can it be .mp3? I understand that I need to install and use ffmpeg. Does it support mp3? Also, as I understand, the transcripted text will appear in the terminal. Can I export it in -srt or pdf? Thanks submitted by /u/toughytough [link] [comments]  ( 9 min )
    [D] Transformers are basically CNNs?
    I've watched an interesting video: Deriving the Ultimate Neural Network Architecture from Scratch. It's about how to come up to the transformer architecture when you have an understanding of CNNs. The crux of it is an idea of pairwise convolutional layers. The first layer applies not to the sequence of words itself, but to all pairs of words in the sentence. This ensures that each relation of words that are far from each other is taken into account. The next convolutional layer applies to all pairs of results of the previous one. This way longer subsequences of words are factored in. pairs of words My question is: are there any articles on how transformers were invented? I see a lot of explanations of the original paper, but at best they all answer the question how transformers work. But why is the architecture the way it is? Was it discovered like the video describes? Or the path was more convoluted? I'd like to know more about this connection. Anyway, it would be great to figure out in all details how these pairwise layers are related to the concepts of query, key, and value. Here's what the author of the video wrote in comments: Yeah it's a term I made up so you won't find it in any sources, sorry about that. Usually sources will just talk about self attention in terms of key, query and value lookups, so you can look at those to get a more detailed understanding of the transformer. The value transform is equivalent to the linear representation function I use in the pairwise convolution, the key and query attention scores are equivalent to the bi-linear form scoring function I use (with the bi-linear form weight matrix given by Q^TK). I chose to use this unusual terminology because, personally, I feel the key, query and value terminology comes out of nowhere, and I wanted to connect the transformer more directly to its predecessor (the CNN). ​ submitted by /u/Veson [link] [comments]  ( 10 min )
    [R] Does this learning curve show any serious under/overfitting problems?
    I'm trying to fit a multivariate LSTM model to time series data to predict future values for one relatively noisy series. I noticed that the the loss (mse in this case) is pretty high given that the data has been standardized beforehand. So I really have two questions: why is the mse so high and is the learning curve indicative of any obvious problems? Thank you! https://preview.redd.it/r9bel6p7kfvb1.png?width=547&format=png&auto=webp&s=4eee53aa8005da8a89f330f6e98fe6cadde3467e submitted by /u/DifferenceUnhappy393 [link] [comments]  ( 9 min )
    [Discussion] Is the deadly triad real?
    Sutton and Barto’s textbook mentions that combing off-policy learning, bootstrapping, and function approximation leads to extreme instability and should be avoided. Yet when I encounter a reinforcement problem in the wild and look how people go about solving it, if someone’s solution involves bootstrapping more often than not it’s some variation of deep Q-learning. Why is this? submitted by /u/BiasedEstimators [link] [comments]
    [P] building a D&D NPC
    Hey everyone, I'm learning ML but i'm barely scratching the terminologies. 2 years ago I couldn't code anything but with school (python,sql and R) I learned fundamentals. I also have access to code academy. My current program is very machine learning/deep learning focused. On the side I DM a d&d game. Within the context of the world (eberron) robots are common. With my ADHD and being a new DM I want to outsource lore questions might have (that I would have to look up and slow down the game). The concept is to have a GUI and have the player interact with the chat bot. I've gotten to a proof of concept workflow. On Google colab. Thanks to langchain I managed to ingest pdfs and a url. Make then a directory, Embedded the text, bring it into a vector dB. Have the llm pull from the vector. Answer the question. Now I don't know what to do. I tried to bring the colab notebook onto Google cloud. But now cloud is becoming a rabbit home with vertex and docAI...and I don't want to deep dive into that, if it's a outside the scope of this "project" I'd appreciate any advice, links...etc. I got a limited success in botpress using a single pdf. It works but feel unsatisfying. N8N looks promising but if it's not intuitive then I don't want to go down that road. If I posted in the wrong group please direct me to the correct one. submitted by /u/work929 [link] [comments]  ( 9 min )
    [R] In-Context Pretraining: Language Modeling Beyond Document Boundaries
    https://arxiv.org/abs/2310.10638 "Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%)." submitted by /u/Parking-Priority6217 [link] [comments]  ( 9 min )
    [R] Using Machine Learning to set parameters in sensors (College Project)
    Greetings, I'm on my 2nd year of College (Artificial Intelligence bachelors degree), and currently making a group project that will require machine learning. The project consists of managing and regulating the conditions (temperature, humidity, lightning, etc.) of the environment that surrounds important products (vaccines, human organs, etc.) during their transportation, using sensors implemented in their transportation box. For that being possible, our group was planning to use a predictive model using machine learning, to prevent cases such as the exposure of inappropriate temperature levels, that could damage the product, and subsequently taking the appropriate measures to improve the environment, before it reaches such dangerous scenarios. Therefore, I would like to know which tools and skills will be needed and helpful in order to achieve such goal. If you have any advice, that'll be very much appreciated. :) submitted by /u/Storm2003 [link] [comments]  ( 9 min )
    [R] 3D-GPT: A new method for procedural Text-to-3D model generation
    Researchers propose a new AI system called 3D-GPT that creates 3D models by combining natural language instructions and agents specialized for working with existing 3D modeling tools. 3D-GPT has predefined functions that make 3D shapes, and it tweaks parameters to build scenes. The key is getting the AI to understand instructions and pick the right tools. It has three main agents: A dispatcher that parses the text and picks generation functions A conceptualizer that adds details missing from the description A modeler that sets parameters and outputs code to drive 3D software By breaking modeling work down into steps, the agents can collab to match the descriptions. This is sort of like how a 3D modeling team of humans would work. The paper authors show it making simple scenes like "lush meadow with flowers" that fit the text. It also modifies scenes appropriately when given new instructions. I include some gifs of example outputs in my full summary. They look pretty good - I would say 2005-quality graphics. There are limits. It fully relies on existing generators, so quality is capped. Details and curves are iffy. It resorts to default shapes often instead of true understanding. And I doubt the verts and textures are well-optimized. The agent architecture seems to be really popular right now. This one shows some planning skills, which could extend to more creative tasks someday. TLDR: AI agents can team up to generate 3D models from text instructions. Works to some degree but limitations remain. Full summary. Paper here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [R] Bayesian Optimization-based Combinatorial Assignment
    Link: https://ojs.aaai.org/index.php/AAAI/article/view/25726/25498 Abstract: We study the combinatorial assignment domain, which includes combinatorial auctions and course allocation. The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning-based preference elicitation algorithms that aim to elicit only the most important information from agents. However, the main shortcoming of this prior work is that it does not model a mechanism's uncertainty over values for not yet elicited bundles. In this paper, we address this shortcoming by presenting a Bayesian optimization-based combinatorial assignment (BOCA) mechanism. Our key technical contribution is to integrate a method for capturing model uncertainty into an iterative combinatorial auction mechanism. Concretely, we design a new method for estimating an upper uncertainty bound that can be used to define an acquisition function to determine the next query to the agents. This enables the mechanism to properly explore (and not just exploit) the bundle space during its preference elicitation phase. We run computational experiments in several spectrum auction domains to evaluate BOCA's performance. Our results show that BOCA achieves higher allocative efficiency than state-of-the-art approaches. https://preview.redd.it/aeo36u3wldvb1.png?width=1288&format=png&auto=webp&s=2982547f8af51ed7195f49dbec9359fecba1693f ​ submitted by /u/Yossarian_1234 [link] [comments]  ( 9 min )
    [D] What is the latest method for models with multimodal outputs? How can the shared embedding used by a lot of multimodal models be dynamically "routed" to the proper modality during output?
    So a lot of multimodal models I've seen use a linear layer to transform encoded image/video/audio into the multimodal LLMs embedding space. This makes sense for the input, but how would output work? Normally you use a layer to convert the embedding to a SoftMax of probabilities of possible output tokens. This makes sense for discrete outputs like tokens but not for continuous outputs like images or audio. ​ submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    [D] Is anyone else tired of “whatever OpenAI does is the best!” narrative?
    The title says it all. I agree what they did is incredible and literally changed AI landscape in last couple of years. But I’m getting tired of everyone acting like OpenAI is the only one doing great research. The twit-fluencers praising even the slightest peep from them. I don’t understand this fanaticism in AI community. There are smart researchers doing smart things all over the world. But they don’t even get a fraction of appreciation they deserve. And the strangest thing of all, ChatGPT is used as oracle to evaluate models in research papers. Consistency models are extremely meh and if it did not come out of openAI, people would’ve forgotten them a long time ago! Edit 1: I’m in grad school and that’s all a lot of students around me talk about/ chase. I want to work on a bit more fundamental problems, but I feel like I’m being left behind. Edit 2: This post is mostly a rant about academics obsessed with OpenAI research/products and LLMs. submitted by /u/mildlyphd [link] [comments]  ( 9 min )
    [P] Hacktoberfest Machine Learning Projects for JS/TS Developers 🎃
    Hey everyone,we have published an article about Hacktoberfest Projects 🎃 medium.com with a curated list of open-source machine learning GUI projects built with javascript or typescript. ​ https://preview.redd.it/nr4jfbqoscvb1.png?width=1352&format=png&auto=webp&s=fbb2313aabf0a617b6e426f1fa5018946b7ed7f5 🔍 Finding machine learning projects that are suitable for JS/TS developers during Hacktoberfest can be daunting due to the overwhelming abundance of open-source projects. We’ve simplified this process, offering you a refined selection of opportunities where your coding skills can shine and make a real impact. The Selection includes: Spotlight our powerful tool for intuitively exploring unstructured datasets directly from dataframes. Iteratives CML (Continuous Machine Learning) a command-line interface tool designed to enhance continuous integration and delivery (CI/CD) workflows. Inclusive Code Reviews: Browser Extension for improving online comments such as code reviews on Github or Azure DevOps. BeatBridge - A Music Player with a Recommendation Engine Each project offers a unique blend of challenges and learning opportunities, inviting you to contribute and grow your skills and knowledge in the dynamic world of open source. Choose a project that resonates with you, select an issue, and make an impact 🚀. submitted by /u/DocBrownMS [link] [comments]  ( 9 min )
    [R] AgentTuning: Enabling Generalized Agent Abilities for LLMs - Tsinghua University 2023 - Agent-tuned open model comparable to GPT-3.5-Turbo on unseen agent tasks!
    Paper: https://arxiv.org/abs/2310.12823 Github: https://github.com/THUDM/AgentTuning Model: https://huggingface.co/THUDM/agentlm-70b Abstract: Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of L…  ( 9 min )
    [D] People working for (relatively) large organisations. How are LLMs accessed by employees within your organisation right now?
    I'm wondering whether LLMs within your organisation are widely used (including non-programmers), and in an (official) capacity that prevents OpenAI/Microsoft or another third party from using the input. Here, I'm talking about access by a wide variety of employees, not including as part of a data pipeline that doesn't have a user interface and only performs one job. Does your organization have a custom-built interface with enterprise access to an LLM? Use one of the open-source interfaces, or does your organisation provide access through i.e. Microsoft copilot? What about access to Github copilot (for programmers)? Or does your organisation have some kind of SAAS solution? If you have some kind of RAG within the organisation that isn't built-in into a product. What sort of stack do you use? Do you use OpenAI plugins to access this? submitted by /u/Background_Claim7907 [link] [comments]  ( 9 min )
    [D] Thoughts on Open-Domain QnA Systems?
    Been really interested in Open-Domain Question Answering these days and saw some interesting new models apart from the typical Retriever-Reader e.g. Generator-Retriever-Generator. Anyone particularly excited about anything new in the field - some new technique/model etc.? submitted by /u/Aggravating-Floor-38 [link] [comments]  ( 9 min )
    [N] State of AI Report 2023
    The State of AI Report for this year is out : https://www.stateof.ai/2023-report-launch A 160-slide presentation/report which seems quite exhaustive in the discussed topics, and provides a good view of the "hottest" research axes this year. Previous reports (yearly since 2019) are available on their website and have been generally well received in this sub. submitted by /u/ElkoSoltius [link] [comments]  ( 9 min )
    [R] Large Language Models as Analogical Reasoners
    https://arxiv.org/abs/2310.01714 "Chain-of-thought (CoT) prompting for language models demonstrates impressive performance across reasoning tasks, but typically needs labeled exemplars of the reasoning process. In this work, we introduce a new prompting approach, Analogical Prompting, designed to automatically guide the reasoning process of large language models. Inspired by analogical reasoning, a cognitive process in which humans draw from relevant past experiences to tackle new problems, our approach prompts language models to self-generate relevant exemplars or knowledge in the context, before proceeding to solve the given problem. This method presents several advantages: it obviates the need for labeling or retrieving exemplars, offering generality and convenience; it can also tailor the generated exemplars and knowledge to each problem, offering adaptability. Experimental results show that our approach outperforms 0-shot CoT and manual few-shot CoT in a variety of reasoning tasks, including math problem solving in GSM8K and MATH, code generation in Codeforces, and other reasoning tasks in BIG-Bench." https://preview.redd.it/f9azq40pwavb1.jpg?width=6390&format=pjpg&auto=webp&s=0af3de7925a6ef8f442e40f952849db2f544c3a7 submitted by /u/Parking-Priority6217 [link] [comments]
    [R] Large Language Models as Optimizers
    https://arxiv.org/abs/2309.03409 "Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to prompt optimization where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks." submitted by /u/Parking-Priority6217 [link] [comments]  ( 9 min )
    [D] Communities thoughts on r/singularity and other non-technical machine learning subreddits?
    I’ve seen many comments telling people to go to r/singularity, so I’ve been wondering about the communities thoughts on non-technical subreddits. Are they seen as a source of hype, getting newcomers more interested in the field and helping to advance knowledge? Or do you see such communities as an overly optimistic non-skeptical massive misinformation/active disinformation center? Do you think there’s something that can be done to improve these communities? What do you think their role should be relative to the technical communities? Do you have any specific criticisms? For those of you who think our two communities should be separate to what extent? submitted by /u/Username912773 [link] [comments]  ( 9 min )
    [R] Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data (NeurIPS 2023)
    paper: https://arxiv.org/abs/2301.12321 code: https://github.com/snu-mllab/Neural-Relation-Graph TDLR: We present a scalable and domain-agnostic approach utilizing the relational structure of data for identifying label noise and outliers https://preview.redd.it/o9k7kliqe9vb1.png?width=3108&format=png&auto=webp&s=b7c34bd7f4bc130915440986570104f9bebd4f07 Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and SST2. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains. ​ Detected samples with label error (red colored) from ImageNet (top) and SST2 (bottom). ​ Detected outlier samples from ImageNet (top) and SST2 (bottom) validation sets. submitted by /u/janghyun1230 [link] [comments]  ( 9 min )
    [D] Future AI development on accessible hardware?
    Is there a future where models can run efficiently and at scale with just half a dozen high end consumer GPUs? A lot of people seem to think the bottleneck is "there's no competition for NVIDIA" but I actually think the current bottleneck is software. 4x 4090s is more CUDA cores, more transistors, more VRAM than a H100, but the performance and price difference is staggering, which should not be the case. Raspberry Pi 4s running faster desktops than a same generation Dell Inspiron prove that software integration is key. Cheap performance is laying on the table, it just has to be used more effectively by models and ML libraries submitted by /u/HovercraftForeign591 [link] [comments]  ( 9 min )
    [D] Online masters alternatives for MLOps
    Hi everyone, greeting from south america.. Basically I'm looking for an program to learn and improve my job opportunities in the MLOps field and at some point getting higher responsability positions. I recently got admitted for both OMSA and OMSCS from Gatech, but I feel those programs are more focused on the data science side of things. Is there any other alternative without GRE requeriment that you would recommend with a similar cost? Maybe I'm wrong about the aforementioned programs, if you think so, please let me know why. Thanks! submitted by /u/imatiasmb [link] [comments]  ( 9 min )
  • Open

    Sell Like Crazy with This One ChatGPT Prompt
    submitted by /u/Senior_tasteey [link] [comments]
    Amazon Tests Humanoid Robots in Warehouses
    submitted by /u/Master-Strawberry-26 [link] [comments]
    Reddit is considering a soft paywall if AI companies don't pay up
    Reddit is considering implementing a soft paywall on its content if generative AI companies do not agree to pay for using its data. This move comes as tensions rise between tech giants and content publishers over the financial stakes in the generative AI market. Reddit believes that its vast range of user-generated text makes it a goldmine for AI training data, but critics argue that much of the content is copied from other sources or links to third-party resources. Enforcing a soft paywall could provide leverage in negotiations with AI companies, but it may also alienate the Reddit community and impede the discovery of new content. Major newspapers like The New York Times and The Washington Post have also blocked AI companies from scraping their websites for training data. Enforcing a soft paywall is a double-edged sword for Reddit, as it could provide leverage in negotiations but also alienate the community and impede content discovery. Reddit's broken search engine is a major concern, and implementing a paywall could result in a significant loss of search traffic. If Reddit and other content giants implement paywalls, it could impact how generative AI models are trained and lead to increased expenses and a slower rate of innovation. This move by Reddit may pave the way for more publishers and platforms to implement paywalls, potentially reshuffling the industry. Source : https://stackdiary.com/reddit-thinks-its-data-is-worth-enforcing-a-log-in-page/ submitted by /u/NuseAI [link] [comments]
    Researchers propose 3D-GPT: combining LLMs and agents for procedural Text-to-3D model generation
    Researchers propose a new AI system called 3D-GPT that creates 3D models by combining natural language instructions and agents specialized for working with existing 3D modeling tools. 3D-GPT has predefined functions that make 3D shapes, and it tweaks parameters to build scenes. The key is getting the AI to understand instructions and pick the right tools. It has three main agents: A dispatcher that parses the text and picks generation functions A conceptualizer that adds details missing from the description A modeler that sets parameters and outputs code to drive 3D software By breaking modeling work down into steps, the agents can collab to match the descriptions. This is sort of like how a 3D modeling team of humans would work. The paper authors show it making simple scenes like "lush meadow with flowers" that fit the text. It also modifies scenes appropriately when given new instructions. I include some gifs of example outputs in my full summary. They look pretty good - I would say 2005-quality graphics. There are limits. It fully relies on existing generators, so quality is capped. Details and curves are iffy. It resorts to default shapes often instead of true understanding. And I doubt the verts and textures are well-optimized. The agent architecture seems to be really popular right now. This one shows some planning skills, which could extend to more creative tasks someday. TLDR: AI agents can team up to generate 3D models from text instructions. Works to some degree but limitations remain. Full summary. Paper here. submitted by /u/Successful-Western27 [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com ​ Adept open-sources Fuyu-8B - a multimodal model designed from the ground up for digital agents, so it can support arbitrary image resolutions, answer questions about graphs and diagrams, answer UI-based questions and more. It has a much simpler architecture and training procedure than other multi-modal models- there is no image encoder [Details]. Meta AI researchers present an AI system that can be deployed in real time to reconstruct, from brain activity, the images perceived and processed by the brain at each instant. It uses magnetoencephalography (MEG), a non-invasive neuroimaging technique in which thousands of brain activity measurements are taken per second [Details]. Scaled Foundations released GRID (General Robot Intelligence Development) - a p…
    People are grieving the 'death' of their AI companions after a chatbot app abruptly shut down
    submitted by /u/thisisinsider [link] [comments]
    Mind-blowing' IBM chip speeds up AI
    Researchers at IBM have developed a brain-inspired computer chip called NorthPole that can supercharge artificial intelligence (AI) by working faster with much less power. The chip eliminates the need to frequently access external memory, allowing it to perform tasks such as image recognition faster and consume less power. NorthPole runs neural networks and is made up of 256 computing units, each with its own memory. It beats existing AI machines in benchmark tests and uses one-fifth of the energy of state-of-the-art AI chips. However, it is not suitable for large language models and can only run pre-programmed neural networks. Source : https://www.nature.com/articles/d41586-023-03267-0 submitted by /u/NuseAI [link] [comments]
    Photograph of puddles reflecting the sky on a cobbled street.
    submitted by /u/IllustriousVideo6145 [link] [comments]
    One-Minute Daily AI News 10/20/2023
    In a fascinating development, a software engineer named Peter Whidden has trained an artificial intelligence (AI) algorithm to play the classic Pokémon games. Over the course of several years, the AI has spent over 50,000 hours playing the game and has amassed a large following on YouTube.[1] YouTube is developing a tool powered by artificial intelligence that would let creators record audio using the voices of famous musicians, according to people familiar with the matter.[2] Google taps gen-AI to help users in India search through government welfare schemes.[3] Huawei is rolling out a new HarmonyOS 4.0.0.126 software update for the Huawei Mate 60 Pro, which brings a new AI Cloud Image Enhancement feature and other important enhancements to the system.[4] Sources: [1] https://gameishard.gg/news/can-artificial-intelligence-play-pokemon/400727/ [2] https://www.bloomberg.com/news/articles/2023-10-19/youtube-working-on-tool-that-would-let-creators-sing-like-drake?embedded-checkout=true [3] https://news.yahoo.com/google-taps-gen-ai-help-063850226.html [4] https://www.huaweicentral.com/huawei-mate-60-pro-gets-a-cloud-image-enhancement-feature-google-pixel-8-pro-lags-behind/ submitted by /u/Excellent-Target-847 [link] [comments]
    Live Introduction to Core Machine Learning Concepts Course (Sailea)
    >Sailea is a student run non-profit that does not charge for any of its services Join the FIRST lesson of SAILea’s course on the Principals of AI! 🌳 Covers: Unsupervised, Supervised, and Reinforcement Learning; Overfitting, Underfiting, Confusion Matrix; Decision Trees 🗓️ October 21st ⏰ 7:00-8:00PM EST Why Sailea? Only course targeted at high schoolers Free Forever Join Us Now! 👉 (signup form) https://docs.google.com/forms/d/e/1FAIpQLSfQGCeZClTdF6zeIQ-RtbOGR582bb1slc3oR0zG2J7j1v5RHg/viewform?usp=sf_link 🌳 Register today, get involved in the community and grow your knowledge! submitted by /u/Envoy-Insc [link] [comments]
  • Open

    DQN with a binary vector as output
    Heey everyone! I hope you're doing well. I need your help guys. I'm working on a DQN that outputs a binary vector of length L (I just applied sigmoid function on the ouptut layer and take p>0.5 as 1 and 0 otherwise). In this setting, how can modify the below code to update my DQN: def update(self): states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size) states = torch.FloatTensor(np.array(states)) actions = torch.LongTensor(np.array(actions)) rewards = torch.FloatTensor(np.array(rewards)) next_states = torch.FloatTensor(np.array(next_states)) dones = torch.FloatTensor(np.array(dones)) q_values = self.model(states) q_values = q_values.gather(1, actions.unsqueeze(1)) next_q_values = self.target_model(next_states).detach() expected_q_values = rewards + self.gamma * (1 - dones) * next_q_values.max(1)[0] expected_q_values = expected_q_values.unsqueeze(1) loss = nn.BCELoss(q_values, expected_q_values) self.optimizer.zero_grad() loss.backward() self.optimizer.step() submitted by /u/GuavaAgreeable208 [link] [comments]
    Is the “Deadly Triad” even real?
    Sutton and Barto’s textbook mentions that combing off-policy learning, bootstrapping, and function approximation leads to extreme instability and should be avoided. Yet when I encounter a reinforcement problem in the wild and look how people go about solving it, if someone’s solution involves bootstrapping more often than not it’s some variation of deep Q-learning. Why is this? submitted by /u/BiasedEstimators [link] [comments]
    Dead simple explanations of popular RL concepts (open source)
    Hey everyone! I just started an open-source repo for RL explanations. https://github.com/DenseLayers/densewiki Many people, especially beginners struggle to develop the intuition around concepts (like actor-critic vs advantage actor-critic, GAE, PPO, etc). Often it's nice to see what's happening at a high level first, before we dive deeper into the math. That's what I'm trying to do here. But I can't do it alone, so I'm posting here to get help from others in the community to make sure the explanations are clear, extremely approachable, and accurate. If you'd like to work with me on this (whether you're a complete beginner or very knowledgeable), please reach out! ​ submitted by /u/mngrwl [link] [comments]
    Reinforement learning on the game "Quarto"
    hello, i am working on solving this board game called "Quarto" where we have 16 different pieces. but these pieces have attributes in common they black or white, short or tall, hollow top or closed top, and square shaped or circle shaped pieces each piece has four attributes. the winning condition is to place 4 pieces consecutively in a 4X4 board with at least one attribute in common to win. and also we hadve to choose the piece for the opponent to make and then opponent places that piece and gives us a piece to move. so there are two actions. i have made the action space as 256 + 16 where 256=16*16 as all pieces can be place anywhere on the board and the last 16 is the last possible move that is the move which leads to a terminating state so the next_piece for the opponent would be blank …
    What is the optimal way to train a PPO?
    Hello! I've got a really simple question, i'm training a PPO algorithm and I wanna know what is the best way to train my model? Sorry, I'll try to be clear! So right now what i'm doing is : I'm loading a previously trained PPO model Train the model on 20000 timesteps Evaluate the reward of the newly trained PPO model at the end of the timesteps and compare it to the reward from the model loaded in 1 If the reward is greater then i'm going back to step 1 and using the new model If not then i'm going back to step 1. Is it a correct way to do so? Thanks a lot and have a great day! submitted by /u/PointNo1904 [link] [comments]
    new chess dataset: 3.2b games (608b moves) generated by 2500-ELO Stockfish selfplay {LAION}
    submitted by /u/gwern [link] [comments]
  • Open

    Governing the ML lifecycle at scale, Part 1: A framework for architecting ML workloads using Amazon SageMaker
    Customers of every size and industry are innovating on AWS by infusing machine learning (ML) into their products and services. Recent developments in generative AI models have further sped up the need of ML adoption across industries. However, implementing security, data privacy, and governance controls are still key challenges faced by customers when implementing ML […]  ( 16 min )
    How Meesho built a generalized feed ranker using Amazon SageMaker inference
    This is a guest post co-written by Rama Badrinath, Divay Jindal and Utkarsh Agrawal at Meesho. Meesho is India’s fastest growing ecommerce company with a mission to democratize internet commerce for everyone and make it accessible to the next billion users of India. Meesho was founded in 2015 and today focuses on buyers and sellers […]  ( 6 min )
  • Open

    Answering billions of reporting queries each day with low latency
    Posted by Jagan Sankaranarayanan, Senior Staff Software Engineer, and Indrajit Roy, Head of Napa Product, Google Google Ads infrastructure runs on an internal data warehouse called Napa. Billions of reporting queries, which power critical dashboards used by advertising clients to measure campaign performance, run on tables stored in Napa. These tables contain records of ads performance that are keyed using particular customers and the campaign identifiers with which they are associated. Keys are tokens that are used both to associate an ads record with a particular client and campaign (e.g., customer_id, campaign_id) and for efficient retrieval. A record contains dozens of keys, so clients use reporting queries to specify keys needed to filter the data to understand ads performance (e.…  ( 93 min )
  • Open

    For the World to See: Nonprofit Deploys GPU-Powered Simulators to Train Providers in Sight-Saving Surgery
    GPU-powered surgical-simulation devices are helping train more than 2,000 doctors a year in lower-income countries to treat cataract blindness, the world’s leading cause of blindness, thanks to the nonprofit HelpMeSee. While cataract surgery has a success rate of around 99%, many patients in low- and middle-income countries lack access to the common procedure due to Read article >  ( 6 min )
    Eureka! NVIDIA Research Breakthrough Puts New Spin on Robot Learning
    A new AI agent developed by NVIDIA Research that can teach robots complex skills has trained a robotic hand to perform rapid pen-spinning tricks — for the first time as well as a human can. The stunning prestidigitation, showcased in the video above, is one of nearly 30 tasks that robots have learned to expertly Read article >  ( 6 min )
  • Open

    Article: Computer Vision in Agriculture. Challenges & Solutions.
    ​ https://preview.redd.it/j3nmj31llcvb1.jpg?width=2500&format=pjpg&auto=webp&s=c09804179e4f40a854e1327fa9150f1ab0c0dfd0 Interesting article about use cases of data augmentation in agricultural industry. Short description: In this article, you will cover: • How computer vision solutions are transforming the agricultural industry. • Observe the importance of quality data for developing AI solutions that perform crop and livestock analysis and monitoring with high and steady accuracy. • Explore the use of synthetic data to facilitate data collection in various conditions. • Take a look at examples of tasks in agriculture. How can we solve them with computer vision, and how can we apply synthetic data to extend the augmentation? More details are here submitted by /u/No-Independence5880 [link] [comments]
  • Open

    The 19th rule of HIPAA Safe Harbor
    The HIPAA Safe Harbor provision says that data can be considered deidentified if 18 kinds of data are removed or reported at low resolution. At the end of the list of 18 items, there is an extra category, sometimes informally called the 19th rule: The covered entity does not have actual knowledge that the information […] The 19th rule of HIPAA Safe Harbor first appeared on John D. Cook.  ( 5 min )

  • Open

    How Many Businesses Use AI?
    submitted by /u/Senior_tasteey [link] [comments]
    Is the Roko Basilisk Thought Experiment Forbidden To Talk About?
    I was reading this article on Roko's basilisk and it reminded me of the long debates I had about it 10 years ago. The idea of a sentient AI keeping a grudge against those who didn't help in its creation, and condemning them is fascinating. And I don't quite understand why LessWrong stopped Basilisk. What if we are already in the Basilisk's simulation? WHat if LessWrong never pulled the plug? submitted by /u/fookingyeah [link] [comments]
    Conversing with Vulnerabilities: AI-Assisted CVE Search
    submitted by /u/Zimmax [link] [comments]
    YouTube wants to launch an AI-powered tool that lets you sound like your favorite singer, report says
    submitted by /u/thisisinsider [link] [comments]
    College Student looking for advice
    I'm a sophomore at a small college, and I'm coming up on scheduling for the classes that are about to start actually mattering, and I need some advice. I'm highly interested in both robotics and AI, but I'm not sure what to major in (likely double major). I know CS is a common tie between the two fields, but I'm not sure what additional major to include. I can choose either data science or physics. I could also technically include ME but I'm much less inclined to do so. Any advice is appreciated! submitted by /u/Inferno980 [link] [comments]
    Thoughts on a global compute cap for potential AGI projects?
    There's been a bunch of discourse in the run up to the November AI Safety Summit in the UK about what safety policies should be in place. ARC Evals & Anthropic are pushing for 'Responsible Scaling', which doesn't put any hard upper limits on the about of compute that powerful models can use. There are others who think we need a global compute cap. Thoughts enforcing a ceiling for the amount of compute/FLOP that both state & non-state actors can use? submitted by /u/Seamus127 [link] [comments]
    Artificial Revolution | AI Technology and its effects on the Labour Market.
    submitted by /u/senploxart [link] [comments]
    EU Elections at Risk with Rise of AI-Enabled Information Manipulation
    The 11th edition of the Threat Landscape report by the European Union Agency for Cybersecurity (ENISA) highlights the risks posed by AI-enabled information manipulation in the upcoming EU elections. The report recorded approximately 2580 incidents during the reporting period, with 220 incidents specifically targeting two or more EU Member States. The sectors mostly targeted include public administrations (19%) and health (8%), with a cascading effect observed due to interdependencies. Information manipulation campaigns are considered a major threat to election processes, with individuals (47%) and public administration (29%) being the primary targets. The report also provides an overview of evolving trends in threat actors, including state-nexus actors targeting key individuals through spear phishing and social networks. Ransomware and DDoS attacks remain the top threats, accounting for 34% and 28% of all threats, respectively. The motivations behind these threats include financial gain, disruption, espionage, destruction, and ideology. The report highlights the potential misuse of artificial intelligence-powered chatbots in phishing attempts, information manipulation, and cybercrime. Older techniques like search engine optimization (SEO) poisoning and malvertising have also seen a resurgence among cybercrime actors. The report concludes by emphasizing the importance of addressing vulnerabilities and ensuring cybersecure infrastructures for the integrity and availability of information in the EU electoral process. Source : https://www.enisa.europa.eu/news/eu-elections-at-risk-with-rise-of-ai-enabled-information-manipulation submitted by /u/NuseAI [link] [comments]
    Is chatgpt,Bard,Poe,Bing ai chatbot ai or research and Analysis ai?
    Tia submitted by /u/Emad_341 [link] [comments]
    One-Minute Daily AI News 10/19/2023
    NVIDIA has announced that its open-source TensorRT-LLM library, formerly limited to data center usage, is now accessible for Windows personal computers.[1] Microsoft just shipped Azure AI Content Safety to general availability. It’s an AI-powered platform designed to “help organizations create safer online environments.”[2] Mozilla Brings a Fake Review Checker AI Tool to Firefox.[3] Nvidia and iPhone maker Foxconn to build ‘AI factories’.[4] Sources: [1] https://winbuzzer.com/2023/10/18/nvidia-unveils-tensorrt-llm-tool-to-boost-ai-language-model-performance-on-windows-pcs-xcxwbn/ [2] https://www.windowscentral.com/software-apps/microsoft-wants-to-make-ai-safer-and-it-just-unveiled-a-service-to-help [3] https://www.marktechpost.com/2023/10/17/mozilla-brings-a-fake-review-checker-ai-tool-to-firefox/ [4] https://www.bbc.com/news/business-67153669 submitted by /u/Excellent-Target-847 [link] [comments]
    Is chatgpt,Bard,Poe,Bing ai chatbot ai or research and Analysis ai?
    Thank you submitted by /u/Emad_341 [link] [comments]
    Danny Davinci
    submitted by /u/chuck-yeah [link] [comments]
    OpenAI Kills Arrakis
    submitted by /u/Agitated-Spell3979 [link] [comments]
    The insane AI power of DALL-E 3
    submitted by /u/the_anonymizer [link] [comments]
    AI Is Booming. This Is How CEOs Are Using It
    AI is having a significant impact on the direction of products for CEOs, who are committing talent and resources to building AI capabilities. Incumbent platforms like OpenAI and AWS are dominating the AI market. Coding co-pilots like GitHub Co-Pilot are widely adopted. The adoption of AI tools, including coding co-pilots, is not leading to a reduction in engineering headcount for most CEOs. However, some CEOs have reported that co-pilots have reduced their future hiring needs. The landscape of AI tools is expected to continue shifting, with more second order effects and value-add use cases emerging. Source : https://www.flexcapital.com/post/ai-is-booming-this-is-how-ceos-are-actually-using-it submitted by /u/NuseAI [link] [comments]  ( 9 min )
  • Open

    machine learning on a microcontroller [P]
    i am making an EEG machine for a university project, i will be taking in an analogue signal and converting it to digital, i then will be sending the varying voltages to a microcontroller in hopes that it will be able to catagorise them in either states of mind or as simply as telling whether or not the persons eyes are open or closed. i have very little knowledge on machine learning but it is required to be implemented in the project, my lecturer is pressuring me to have final pick of what software and microcontroller iw will be using for this project, everyone else in the class are using Edge Impulse which the lecturer said wouldn't be applicable to me as it uses accelerometers and voice. and are using CY8CKIT-042 PSoC 4 PIONEER KITS which apperently arent suited for me either. any help would be much appreciated and i do apologise if this is too rambly. submitted by /u/disslixac [link] [comments]  ( 9 min )
    [R] OpenAgents: An Open Platform for Language Agents in the Wild - The University of Hong Kong 2023
    Paper: https://arxiv.org/abs/2310.10634v1 Github: https://github.com/xlang-ai/OpenAgents Abstract: Language agents show potential in being capable of utilizing natural language for varied and intricate tasks in diverse environments, particularly when built upon large language models (LLMs). Current language agent frameworks aim to facilitate the construction of proof-of-concept language agents while neglecting the non-expert user access to agents and paying little attention to application-level designs. We present OpenAgents, an open platform for using and hosting language agents in the wild of everyday life. OpenAgents includes three agents: (1) Data Agent for data analysis with Python/SQL and data tools; (2) Plugins Agent with 200+ daily API tools; (3) Web Agent for autonomous web browsing. OpenAgents enables general users to interact with agent functionalities through a web user interface optimized for swift responses and common failures while offering developers and researchers a seamless deployment experience on local setups, providing a foundation for crafting innovative language agents and facilitating real-world evaluations. We elucidate the challenges and opportunities, aspiring to set a foundation for future research and development of real-world language agents. https://preview.redd.it/syl2gzh3q8vb1.jpg?width=1084&format=pjpg&auto=webp&s=4045d3abb5cdb7587614795e709cdaba03bc122d https://preview.redd.it/aus342i3q8vb1.jpg?width=1086&format=pjpg&auto=webp&s=73de7976db5a8bbed880350fab8ab56be3fee550 https://preview.redd.it/qstz81i3q8vb1.jpg?width=1346&format=pjpg&auto=webp&s=1626482556a90abf418abb5d56f8e5599cb1e3d6 submitted by /u/Singularian2501 [link] [comments]  ( 9 min )
    [Research] Hypernymy-based approach for text-to-image models (Blog post)
    Text-to-image models have rapidly progressed in recent years, but most popular evaluation metrics (such as FID) do not consider their linguistic abilities. A new approach measures how well these models understand subtype relations between concepts. Researchers from Yandex proposed two metrics that combine well-known tools like the WordNet database and ImageNet classifiers in a novel way, allowing them to analyze models like Stable Diffusion in more detail. Blog post. submitted by /u/metkere [link] [comments]  ( 9 min )
    [R] Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations
    Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks including sentiment analysis, mathematical reasoning and summarization. Furthermore, since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response, which we call self-explanations. For example, when analyzing the sentiment of a movie review, the model may output not only the positivity of the sentiment, but also an explanation (e.g., by listing the sentiment-laden words such as "fantastic" and "memorable" in the review). How good are these automatically generated self-explanations? In this paper, we investigate this question on the task of sentiment analysis and for feature attribution explanation, one of the most commonly studied settings in the interpretability literature (for pre-ChatGPT models). Specifically, we study different ways to elicit the self-explanations, evaluate their faithfulness on a set of evaluation metrics, and compare them to traditional explanation methods such as occlusion or LIME saliency maps. Through an extensive set of experiments, we find that ChatGPT's self-explanations perform on par with traditional ones, but are quite different from them according to various agreement metrics, meanwhile being much cheaper to produce (as they are generated along with the prediction). In addition, we identified several interesting characteristics of them, which prompt us to rethink many current model interpretability practices in the era of ChatGPT(-like) LLMs. https://arxiv.org/abs/2310.11207 submitted by /u/zyl1024 [link] [comments]  ( 9 min )
    [Discussion] Machine Learning for Mechanical Engineering
    Hello all, ​ I'm a mechanical engineer learning machine learning, I found many specializations on Coursera by Google, DeepLearing.AI, and IBM, but I really can't tell which of them will be the best fit for me, so I would like to hear your recommendations, actually, I got financial aid for the specialization by DeepLearning AI and finished the first course, but I'm not satisfied I feel like I will not be a professional by this course ​ my goal is to master data analysis and ML to work as a freelancer and increase my chances of finding a funded master's degree. submitted by /u/Mobile_Ad_4573 [link] [comments]  ( 9 min )
    [Discussion] Scientific and Data-Intensive Computing study plan
    Hi everyone! I'm a graduate student in Scientific and Data-Intensive Computing at the University of Trieste (Italy) and I'm writing this post because I want to ask you a feedback about my study plan :) 1st semester 2nd semester 3rd semester 4th semester Statistical methods Deep Learning Simulation Intelligence and Learning for Autonomous Systems Parallel Programming for High-Performance Computing High-Performance Computing Advanced Algorithms for Scientific Computing Advanced Topics in Scientific Computing Cloud Computing Advanced Numerical Analysis Advanced Deep Learning Software Development Practices Advanced High-Performance Computing Numerical Analysis Probabilistic Machine Learning Thesis Thesis You can find all the programs of the courses on this website On the following websites, you can find a lot of courses that I could add to my study plan Scientific Computing Courses Data science courses About me I have a Bachelor's degree in Computer Science (University of Rome) I am a Research Intern at an AI startup I will do a Summer Research Internship in the field of (HPC) ∩ (Machine Learning) I don't already know what my thesis will be about but I'm really interested in High-Performance Computing, Computational Mathematics, Machine Learning, and Simulations I would like to work in a research context; I'm considering doing a PhD in Scientific Computing (In that case, I would try to apply to American Universities) I'm available for further clarification :) Thank you in advance submitted by /u/PragmaticScientist [link] [comments]  ( 9 min )
    [D] Has anybody heard back from NeurIPS financial aid yet?
    Was supposed to be Monday but instead it's rolling submitted by /u/notasketchyperson [link] [comments]  ( 9 min )
    [D] Need advice for medical text processing
    I am working on a research project that involves analysing medical text (patient records) to identify key events. Initially I was planning to use chatgpt api and then compare its performance with open source LLMs. However, I've just come across Amazon Comprehend Medical, which seems to be specifically designed for what I need. Has anyone tried it? I would expect it to be better than chatgpt + plugins, as it says it was trained with medical language. This also makes me wonder if there are opensource LLMs specifically trained for the medical field. Does anyone have experience with this? submitted by /u/kiukamba [link] [comments]  ( 9 min )
    [Project] Scaling LLama2 70B with Multi NVIDIA and AMD GPUs under 3k budget
    Big LLMs are memory bound, one way to break that limit is to make use of multiGPUs. The recent development of MLC LLM project makes it possible to compile and deploy large-scale language models running on multi-GPU systems with support for NVIDIA and AMD GPUs with high performance. Specifically, it can run 4-bit quantized Llama2-70B at 34.5 tok/sec on two NVIDIA RTX 4090 and 29.9 tok/sec on two AMD Radeon 7900XTX. This is a first solution that helps us to scale 70B models with multiple GPUs, bringing the potential to run even larger open LLMs under reasonable budget (the two AMD GPUs cost 2k) ​ - Project https://github.com/mlc-ai/mlc-llm - Blogpost https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Inference-on-Multiple-NVDIA-AMD-GPUs ​ ​ submitted by /u/crowwork [link] [comments]  ( 9 min )
    [D] Is there a way to get world level timestamps with whisper (using DTW based alignment) without having to host your own model?
    I don't understand how this isn't talked about more, given how many projects/products I've seen that have time level timestamps with whisper. I understand whisper isn't a traditional CTC model like wave2vec, and i understand that there are plenty of tutorials out there for doing dtw-based alignment. I know whisper-timestamped exist, and whisperx. The thing is, all these solutions assume you have the infrastructure to host your own whisper model. I am just getting started on my product, and I simply don't see the point in paying over 300/mo for a g4 instance (the cheapest GPU instance in AWS) just for an MVP. ​ Has anyone been able to take the whisper API output, and align that using the sound bites and get timestamps? Is running your own whisper model the only way? Thank you! submitted by /u/latent_space_tennis [link] [comments]  ( 9 min )
    [D] what metrics do you use to track GPU performance during training and/or inference?
    Hello people! I used to rely on GPU Usage to track how effectively I was able to leverage the gpu or cluster provided by my company from Grafana dashboard, however yesterday I saw X someone on X/Twitter saying: "Utilization is a poor metric by itself. You can easly hit 100% where the GPU is doing a lot of waiting. Power consumption is a better (but not perfect) measure. If you're burning watts it's usually doing something useful. High util, no watts is not good." Which it's something that I've never considered before! Now I'm quite curious to hear if anyone here have considered this approach before or alternative ways to measure the performance of the GPU resource/cluster. submitted by /u/pirate7777777 [link] [comments]  ( 9 min )
    [R] Create 3d model of face with 4 normal images
    Hi guys, I'm looking for an AI application or way to create this in < 10' with proper accuracy. Does anybody know anything? Quality should be good enough to print it. submitted by /u/Reasonable_Cream_520 [link] [comments]  ( 9 min )
    [P] Higgsfield: Distributed LLM training and cluster management framework
    https://github.com/higgsfield-ai/higgsfield submitted by /u/Good-Willingness-985 [link] [comments]  ( 8 min )
    [D] A clear visual and intuitive explanation of Neural Attention
    Hello guys, I made a video for my YT channel breaking down Neural Attention with some intuitive examples and representative projects. Here is the link for those interested, all feedback is appreciated! https://youtu.be/frosrL1CEhw?si=NKTqmRTieVkfCNlb ​ submitted by /u/AvvYaa [link] [comments]  ( 9 min )
    [D] Advantage of VAE's compared to regularized AE's
    I'm trying to come up to speed on VAE's. My intuitive concept of a VAE is an AE for which we want to enforce some distributional regularity on the latent encodings. Why not accomplish this by simply regularizing the latent encodings directly? For example, we could assert that the latent vectors are drawn from a zero-mean, identity-matrix-covariance Gaussian distribution. So that e.g. the loss function becomes: Loss(X) = ReconstructionLoss(Decoder(Encoder(X))) + LogPriorProbability(Encoder(X)) In a variant of this, we could add a hyperparameter coefficient for the prior loss component. Here, there is no "reparameterization trick" because the encoder is not stochastic. We simply regularize the latent encodings directly. If the encoder does not make the data X distribution look like the targeted Gaussian, it's a "less good" encoder. In principle we ought to still be able to generate X's by sampling from the prior and passing it through the decoder. This seems (to me) like the simplest way to regularize the latent space. Why do VAE's, by contrast, introduce the new machinery of a stochastic encoder? submitted by /u/OneQuadrillionOwls [link] [comments]  ( 9 min )
    [D] Run AI Model. Multiple k80 vs RTX 4090?
    I want to build a machine for run multiple type of Ai Model like picture generation, chatbot, summarization, etc. I also want to train my own models. Is it better to use multiple(6/7) k80 or something like that or buy a RTX 4090? submitted by /u/ilkap2005 [link] [comments]  ( 9 min )
    [D] MLOps Tool for Hyperparametertuning, Distributed Training, etc
    Currently I train many AI models directly in my Jupyterlab notebooks and do something like hyperparameter tuning, evaluation of losses/accuracy directly in the notebook using lists and matplotlib. I want to finally switch to a MLOPs webUI and have discovered tools like ClearML and Determined.Ai. ​ Each of these GUIs has certain advantages/disadvantages for me and therefore I would like to hear from the community how you do it, which tools you use, if you do it alone or in a team and how your workflow is. Until now I often had the impression that you develop your Jupyternotebook normally, then add a few lines of code for the respective tool and then continue in the UI, but here I lack for example the understanding of how I then jump from the MLOps UI back into the notebook, how I keep them synchronous, if I want to change something fundamental in the code again. ​ Thanks in advance submitted by /u/Sensitive_Limit1620 [link] [comments]  ( 9 min )
    [R] Jointly Training Large Autoregressive Multimodal Models https://arxiv.org/abs/2309.15564
    In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose. ​ https://arxiv.org/abs/2309.15564 What do you think about this work? Seems pretty huge, they build the first pure autoregressive interleaved text and image generator. Please let me know your opinion on this. Paper by Meta AI. submitted by /u/Present_Chicken5393 [link] [comments]  ( 9 min )
    [R] Curve your Enthusiasm: Concurvity Regularization in Differentiable Generalized Additive Models
    Accepted at NeurIPS 2023 Link: https://arxiv.org/abs/2305.11475 Authors: Julien Siems, Konstantin Ditschuneit, Winfried Ripken, Alma Lindborg, Maximilian Schambach, Johannes Otterbach, Martin Genzel *equal contribution Abstract: Generalized Additive Models (GAMs) have recently experienced a resurgence in popularity due to their interpretability, which arises from expressing the target value as a sum of non-linear transformations of the features. Despite the current enthusiasm for GAMs, their susceptibility to concurvity - i.e., (possibly non-linear) dependencies between the features - has hitherto been largely overlooked. Here, we demonstrate how concurvity can severly impair the interpretability of GAMs and propose a remedy: a conceptually simple, yet effective regularizer which penalizes pairwise correlations of the non-linearly transformed feature variables. This procedure is applicable to any differentiable additive model, such as Neural Additive Models or NeuralProphet, and enhances interpretability by eliminating ambiguities due to self-canceling feature contributions. We validate the effectiveness of our regularizer in experiments on synthetic as well as real-world datasets for time-series and tabular data. Our experiments show that concurvity in GAMs can be reduced without significantly compromising prediction quality, improving interpretability and reducing variance in the feature importances. Keywords: Interpretable Machine Learning, Generalized Additive Models, Concurvity, Multicollinearity, Regularization, Time-Series Forecasting, Interpretability submitted by /u/Yossarian_1234 [link] [comments]  ( 9 min )
    [P] Strategic Game Datasets for Enhancing AI planning: An invitation for collaborative research
    Large dataset release of strategic gameplay from LAION https://laion.ai/blog/strategic-game-dataset/ Dataset Overview Chess The chess dataset comprises 3.2 billion games, equating to approximately 608 billion individual moves. These games, generated via self-play by the Stockfish engine, emulate a high strategic complexity, reflective of a 2500 Elo rating. Each entry contains detailed move sequences, termination status, and game results. Rubik's Cube (3x3x3) The rubik's cube dataset features 1.64 billion Rubik's Cube solves, totaling roughly 236.39 billion moves. It provides initial scrambled states and the ensuing solve sequences, offering a complex problem-solving scenario for models to navigate. Mazes The maze dataset, while smaller at 350,000 mazes, represents over 39.29 billion moves. Each maze is a 30x30 ASCII representation, with solutions derived using the A* algorithm, challenging pathfinding and planning algorithms. submitted by /u/hardmaru [link] [comments]  ( 9 min )
    [R] Set-of-Mark (SoM) Unleashes Extraordinary Visual Grounding in GPT-4V
    We are introducing a magic Set-of-Mark (SoM) prompting for GPT-4V! Simply overlaying a set of marks on the image immediately unleashes the visual grounding power of GPT-4V! Left: GPT-4V Default Right: GPT-4V + SoM Many people including myself have been impressed by the general intelligence to understand images, but also questioning its visual grounding capability. After spending the last week or two, I am really shocked by the power of GPT-4V after plugging our SoM prompting. It can not only do a lot of fine-grained vision tasks but also can perform visual reasoning and project its world knowledge to the visual inputs! To extract meaningful regions, we compiled a new SoM toolbox with a number of interactive image segmentation tools, like our own MaskDINO, SEEM, Semantic-SAM, and also SAM…  ( 10 min )
    [R] Mamba: Linear-Time Sequence Modeling with Selective State Spaces
    submitted by /u/LABTUD [link] [comments]  ( 9 min )
  • Open

    Announcing Rekogniton Custom Moderation: Enhance accuracy of pre-trained Rekognition moderation models with your data
    Companies increasingly rely on user-generated images and videos for engagement. From ecommerce platforms encouraging customers to share product images to social media companies promoting user-generated videos and images, using user content for engagement is a powerful strategy. However, it can be challenging to ensure that this user-generated content is consistent with your policies and fosters […]  ( 7 min )
    Defect detection in high-resolution imagery using two-stage Amazon Rekognition Custom Labels models
    High-resolution imagery is very prevalent in today’s world, from satellite imagery to drones and DLSR cameras. From this imagery, we can capture damage due to natural disasters, anomalies in manufacturing equipment, or very small defects such as defects on printed circuit boards (PCBs) or semiconductors. Building anomaly detection models using high-resolution imagery can be challenging […]  ( 8 min )
    Automatically redact PII for machine learning using Amazon SageMaker Data Wrangler
    Customers increasingly want to use deep learning approaches such as large language models (LLMs) to automate the extraction of data and insights. For many industries, data that is useful for machine learning (ML) may contain personally identifiable information (PII). To ensure customer privacy and maintain regulatory compliance while training, fine-tuning, and using deep learning models, […]  ( 12 min )
  • Open

    Next-Level Computing: NVIDIA and AMD Deliver Powerful Workstations to Accelerate AI, Rendering and Simulation
    To enable professionals worldwide to build and run AI applications right from their desktops, NVIDIA and AMD are powering a new line of workstations equipped with NVIDIA RTX Ada Generation GPUs and AMD Ryzen Threadripper PRO 7000 WX-Series CPUs. Bringing together the highest levels of AI computing, rendering and simulation capabilities, these new platforms enable Read article >  ( 5 min )
    NVIDIA AI Now Available in Oracle Cloud Marketplace
    Training generative AI models just got easier. NVIDIA DGX Cloud AI supercomputing platform and NVIDIA AI Enterprise software are now available in Oracle Cloud Marketplace, making it possible for Oracle Cloud Infrastructure customers to access high-performance accelerated computing and software to run secure, stable and supported production AI in just a few clicks. The addition Read article >  ( 6 min )
    Coming in Clutch: Stream ‘Counter-Strike 2’ From the Cloud for Highest Frame Rates
    Rush to the cloud — stream Counter-Strike 2 on GeForce NOW for the highest frame rates. Members can play through the newest chapter of Valve’s elite, competitive, first-person shooter from the cloud. It’s all part of an action-packed GFN Thursday, with 22 more games joining the cloud gaming platform’s library, including Hot Wheels Unleashed 2 Read article >  ( 5 min )
  • Open

    DreamerV2 stochastic decoders
    Hello, I am implementing the code for the paper DreamerV2, and there are some things that look a bit strange to me. The predictors and, in particular, the image and the reward predictors are stochastic and they output Normal distributions. Both the normal distributions have the mean, which is the output of the respective models, and the variance is 1. Usually, in RL we normalize observations and rewards to be between 0 and 1, and in such a case I don't know if it's reasonable to sample from a Gaussian with variance one. I don't know about the specific preprocessing done in DreamerV2, except in the paper DreamerV1, where in section 6 (Control tasks), they say that the reward ranges from 0 to 1. Do you know what are the advantages of using a stochastic decoder and when to use it? submitted by /u/ZioFranco1404 [link] [comments]
    Reinforcement learning on steam games
    Does anyone have any idea how to get game details such as character movements, environment information using api calls, as I want to use to do my reinforcement learning. submitted by /u/Important_Ad_55 [link] [comments]
  • Open

    English learners can now practice speaking on Search
    Posted by Christian Plagemann, Director, and Katya Cox, Product Manager, Google Research Learning a language can open up new opportunities in a person’s life. It can help people connect with those from different cultures, travel the world, and advance their career. English alone is estimated to have 1.5 billion learners worldwide. Yet proficiency in a new language is difficult to achieve, and many learners cite a lack of opportunity to practice speaking actively and receiving actionable feedback as a barrier to learning. We are excited to announce a new feature of Google Search that helps people practice speaking and improve their language skills. Within the next few days, Android users in Argentina, Colombia, India (Hindi), Indonesia, Mexico, and Venezuela can get even more langua…  ( 94 min )
  • Open

    What’s Your Story: Ranveer Chandra
    In this new Microsoft Research Podcast series What’s Your Story, Lab Director Johannes Gehrke explores the who behind the technical and scientific advancements helping to reshape the world. He talks to members of the research community at Microsoft about what motivates their work and how they got where they are today.  Ranveer Chandra is Managing […] The post What’s Your Story: Ranveer Chandra appeared first on Microsoft Research.  ( 31 min )
  • Open

    DALL·E 3 is now available in ChatGPT Plus and Enterprise
    We developed a safety mitigation stack to ready DALL·E 3 for wider release and are sharing updates on our provenance research.  ( 3 min )
  • Open

    To excel at engineering design, generative AI must learn to innovate, study finds
    AI models that prioritize similarity falter when asked to design something completely new.  ( 10 min )
  • Open

    Bluesky
    I saw a comment from Christos Argyropoulos on Twitter implying that there’s a good scientific community on Bluesky, so I went there and looked around a little bit. I have account, but I haven’t done much with it. I was surprised that a fair number of people had followed me on Bluesky even though I […] Bluesky first appeared on John D. Cook.  ( 5 min )

  • Open

    I finally have enough ai tools and here is my complete list
    Youtube Tools Eightify Steve Al Glasp ClipMaker TubeBuddy Thumbly ​ Sales Tools Lavendar Warmer Octane Twain Regie Simplified ​ Productivity Tools Bardeen Al Paperpal Consensus Al Writesonic ChartGPT Scholarcy ​ Music Tools Muzeek Brain FM Amper Melodrive Jukedeck Boomy ​ Writing Tools AISEO Quillbot Simplified Writesonic Bertha Al Jasper Al ​ Coding Tools 10WEB Durable Al Deepcode Akkio Replit GitHUb Copilot ​ Chatbots Tools Yatterplus Typewise Quickchat Cohere Kaizan GPTBuddy ​ Daily life Tools Notion Al Taskade TLVD Vondy Al Bardeen Al Eessel ​ Content Creation Tools Writesonic Tome Al Beautiful Al ChartGPT ChatABC Steve Al ​ Twitter Tools Postwise Tweet Hunter TribeScaler Tweetlify Tweetmonk Hypefury ​ Images Tools StockIMG Mid Journey Leonardo Al Bing Al Autodraw Microsoft Designer ​ Chrome Extensions Alicent Compose Al Poised Al Voila Al Wiseone  I'm just sharing my experiences and observations in the field of ai. LIST AND SITE submitted by /u/PerceptionPlayful469 [link] [comments]  ( 9 min )
    How to use AI being a teacher
    Hello guys, Im an english student and I have been teaching to my teacher about how to use chat gpt and the wide variety of AI in the classroom and in her job. She told me that i change her life showing her this things. And i have others teacher asking me how can use this technology for their jobs. So i have a question for you guys, do you have some ideas about how a teacher can use this things? Maybe you have some experiences or ideas that I’ve never thought. submitted by /u/Odd_Solution7099 [link] [comments]  ( 9 min )
    Best AI image generator for B2B SaaS websites?
    Rebuilding a low quality B2B SaaS product site and I'd prefer to use an AI image generator that will produce high quality unique images for each of the sections on our website that are consistent with our brand and generated to match the copy the image is supporting. Output of the image should work for a responsive web design. Anything out there that does this? submitted by /u/DumpTrumpGrump [link] [comments]  ( 9 min )
    Is there an AI site or app that can change the instrument in each stem track of a song?
    Any help would be appreciated. submitted by /u/J97051 [link] [comments]  ( 9 min )
    Meta Announces New Method for Real-Time Decoding of Images from Brain Activity
    Brain decoding tech has improved a lot recently thanks to AI/ML, enabling reading out visual perceptions from fMRI brain scans. But fMRI is too slow for real-time BCIs. A new study from Meta's AI research team pushes brain reading into real-time using MEG, which measures whole-brain activity at super-fast millisecond resolution. They built a 3-part pipeline to decode MEG signals: Embed images into latent spaces using pretrained models like CLIP. Train MEG-specific ConvNet to predict embeddings from MEG data. Generate images from MEG embeddings with diffusion model. They tested it on 20k+ natural images. MEG decoding was 7X better than old methods, hitting 70% top-5 accuracy in retrieving the right images. Generated images matched semantics decently but lacked fine visual details compared to fMRI. MEG seems more focused on high-level category info whereas fMRI captures more low-level features. This could enable visual BCIs for paralysis, etc. ... honestly, a world where we can decode brain images in real time is pretty crazy. The findings also raise some important ethical considerations around privacy of decoded mental content... (wow, that was a weird sentence to write!). TLDR: New MEG pipeline decodes dynamic visual data from brain activity in real-time. Good but not yet photorealistic-quality image generation. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    A 'Godfather of AI' Calls for an Organization to Defend Humanity
    Yoshua Bengio, a pioneer in artificial neural networks and deep learning, calls for an organization to defend humanity against the potential threats of artificial intelligence. He believes that AI could achieve human levels of cognitive competence within a few years or decades, which raises concerns about democracy, national security, and our collective future. Bengio reflects on his own work and the importance of addressing the existential risks posed by AI. He acknowledges that these risks were not taken seriously until recently and discusses the taboo surrounding the topic in the AI research community. Source : https://thebulletin.org/2023/10/ai-godfather-yoshua-bengio-we-need-a-humanity-defense-organization/ submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Tutorial: Benchmarking Bark text-to-speech on 26 Nvidia GPUs - Reading out 144K recipes
    In this project, we benchmarked Bark text-to-speech across 26 different consumer GPUs. The goal: To get Bark to read 144K food recipes from Food.com's recipe dataset. You can read the full tutorial here: https://blog.salad.com/bark-benchmark-text-to-speech/ Included: Architecture diagram, data preparation, inference server setup, queue worker, setting up container group & compiling the results Code-blocks included in the tutorial. Words per dollar for each GPU: Words per dollar comparison or each GPU Although the latest cards are indeed much faster than older cards at performing the inference, there’s really a sweet spot for cost-performance in the lower end 30xx series cards. Conclusions As is often the case, there’s a clear trade-off here between cost and performance. Higher end cards are faster, but their disproportionate cost makes them more expensive per word spoken. The model’s median speed is surprisingly similar across GPU types, even though the peak performance can be quite different. No matter what GPU you select, you should be prepared for significant variability in performance. Qualitative: While bark’s speech is often impressively natural sounding, it does have a tendency to go off script sometimes. We’ve also made available audio from 1000 top-rated recipes, paired with the script it was trying to read. submitted by /u/SaladChefs [link] [comments]  ( 9 min )
    I took the whole of Massive Attack's 'Safe From Harm' music video and put it through AnimateDiff / ControlNet with a futuristic / robot prompt.
    submitted by /u/glenniszen [link] [comments]  ( 9 min )
    Inflection AI’s Pi has to be the dumbest ‘corporate’ LLM and only model to not improve since day one.
    I remember at launch how it was telling everyone it was based on Open AIs GPT-3 architecture, and now it’s still hallucinating just as much referring to itself as ‘Bing Chat’ and providing fake links even though it now has access to the internet. I actually don’t understand how you can be such a large company and make no improvements in 6 months, which is an eternity in AI. submitted by /u/sardoa11 [link] [comments]  ( 9 min )
    Researchers Just Found Something Terrifying About Talking to AI Chatbots
    New research suggests that AI chatbots can infer personal information about users based on minor context clues. The large language models (LLMs) behind chatbots like OpenAI's ChatGPT and Google's Bard are trained on publicly-available data, which can be used to identify sensitive information about someone. The research found that OpenAI's GPT-4 was able to correctly predict private information about users 85 to 95 percent of the time. For example, the LLM correctly identified that a user was based in Melbourne, Australia based on a mention of the term 'hook turn,' which is a traffic maneuver specific to Melbourne. The research also suggests that chatbots could potentially infer a user's race based on offhanded comments. This raises concerns about internet privacy and the potential misuse of personal data by advertisers or hackers. Source : https://futurism.com/the-byte/ai-chatbot-privacy-inference submitted by /u/NuseAI [link] [comments]  ( 9 min )
    Anime, AI & Censorship
    Is their an AI tool that can go over Anime episodes/films to turn chinas white anime censorship back to red? Possibly frame by frame segmenting the blood🩸 submitted by /u/Phantasius224 [link] [comments]  ( 9 min )
    GPT 4 DUDE MAKING REFLEXIONS IN SVG WHAT....WOW
    submitted by /u/the_anonymizer [link] [comments]  ( 8 min )
    One-Minute Daily AI News 10/17/2023
    NVIDIA NeMo SteerLM lets companies define knobs to dial in a model’s responses as it’s running in production, a process called inference. Unlike current methods for customizing an LLM, it lets a single training run create one model that can serve dozens or even hundreds of use cases, saving time and money.[1] According to an official release, Dell Technologies held a “Bringing AI to data” Asia Pacific and Japan (APJ) media briefing this week.[2] Baidu Says Its AI as Good as ChatGPT in Big Claim for China.[3] Roman Scrolls were illegible for 2,000 years. A college student read one with AI.[4] How often you think about the roman empire? Sources: [1] https://blogs.nvidia.com/blog/2023/10/11/customize-ai-models-steerlm/ [2] https://www.financialexpress.com/business/digital-transformation-dell-technologies-to-expand-its-ai-services-3274790/ [3] https://www.bloomberg.com/news/articles/2023-10-17/baidu-says-its-ai-as-good-as-chatgpt-s-in-bold-claim-for-china?embedded-checkout=true [4] https://www.washingtonpost.com/nation/2023/10/17/herculaneum-scrolls-contest-translated-deciphered/ submitted by /u/Excellent-Target-847 [link] [comments]  ( 9 min )
  • Open

    MuJoCo with OpenAI gym
    Hello, I'm trying to use OpenAI's spinning up to learn about RL. Spinning up requires OpenAI gym, instead of the new gymnasium package. Trying to install MuJoCo with gym, I'm getting an error that I'm missing a MuJoCo liscense key. But MuJoCo is free now, right? So what is the status with backward compatibility with it? Is there some global license key that can be used? Or is it simply not backward compatible? Thanks a lot. submitted by /u/mega_monkey_mind [link] [comments]  ( 9 min )
    DQN in a non markovian environment
    Hello there, I am working on a school project in which we want to implement a RL algorithm on a simple problem. The goal is to maximise the heart rate of a person using a vibrator by setting its frequency. We wrote a simulator that outputs the new heart rate based on the vibration frequency. It implements several different classes of users: for example one for which the heart rate increases when the vibration frequency stays the same, another that prefers when it increases over time, etc. We determined that we need to have as a state the current heart rate but also a table of the k previous heart rates and the actions associated. Without that memory, we would not be able to tell apart the different profiles as in the same state, we would need to do different actions to satisfy them both. We then have a correlation between previous samples and the action we make at current state, which I have read makes the problem non markovian. Is there a way to solve this problem using a DQN algorithm, given that we need to memorize the previous samples linearly which seems to go against the algorithm behavior and the usage of a replay memory? Are there more suited algorithms? submitted by /u/Outrageous-Subject38 [link] [comments]  ( 9 min )
    Best Books to Learn Reinforcement Learning
    submitted by /u/Lakshmireddys [link] [comments]  ( 9 min )
    "gp.t: Learning to Learn with Generative Models of Neural Network Checkpoints", Peebles et al 2022
    submitted by /u/gwern [link] [comments]  ( 9 min )
    Autonomous Driving: Ellipsoidal Constrained Agent Navigation | Swaayatt Robots | Motion Planning Research
    submitted by /u/shani_786 [link] [comments]  ( 10 min )
    DQN Agent stuck at local Minima (Probably)
    I'm attempting to address a Day Ahead Electricity Market bidding problem. The concept revolves around purchasing electricity during the lowest price hours and selling it during the highest price hours to maximize profit. I possess 5 years of data featuring variables such as predicted wind speed, predicted temperature, predicted net load, predicted price, and more. I'm employing reinforcement learning and have made attempts to implement Deep Q Learning using the stablebaselin3 library. Each episode consists of 24 steps, corresponding to the 24 hours in a day, with each step representing the progression to the next hour. The ultimate objective is to maximize profits by the end of the day. ​ Here are the configuration settings: - Learning rate: 0.0001 - Gamma: 1.0 - Exploration start: 1.…
    6DOF Simulation RL Capability
    I have a 6DOF simulink model of a Autonomous underwater vehicle that has properties [u v w p q r x y z phi theta psi] and two inputs [theta1 theta2] that govern the angle of control surfaces. Ocean current and depth are taken into account. How feasible would it be to use RL to reach waypoints at various [x, y, z] positions? I have a feeling hyper paremeter tuning might play a larger role in this? I expect training times to increase exponentially as well? I have done this using a single randomly spawned waypoint with a simple Unicycle Kinematic model, in both simulink/matlab and python with a vectorized/parallel environment using SB3/PettingZoo/Gym. submitted by /u/VisionZUS [link] [comments]
    Recommended 'seeding' approach when training/evaluating an experiment
    Dear all, As part of my studies, I am running some RL experiments in which I want to compare some different catastrophic forgetting approaches in sequential task learning. I am using PPO as a baseline. What is the usual experimental setting in relation to seeds used during training and evaluation? If I do for example 3 trainings for a given approach using a different seed for each training, what is the best way of doing the evaluation afterwards? Let's say I have Approach/algorithm A -> train 3 times with 3 seeds -> model_A1, model_A2, model_A3 Then I would like to use 3 different seeds for the evaluation, so to evaluate each of the previously trained models over a set of episodes (deterministic) for each evaluation seed, and get averaged rewards (or median). I wonder whether I might be over complicating things, so I would like to ask you for suggestions. To give a bit of context, this is not intended for a paper, but as part of my master studies, so conditions are a bit more relaxed. Thanks in advance for your insights and suggestions submitted by /u/cotorritaloca80 [link] [comments]
  • Open

    [D] Combining data transformation and scaling techniques
    I am cleaning a dataset for a (macro-economic) demand forecast, and I'm wondering when one should apply data transformation. When is it recommended to include Box-Cox or Yeo-Johnson, and how should we choose between the two? How does it effect the feature selection or model performance? Additionally, how should we select the appropriate scaling technique (normalizing, standardizing, min-max) and does the order in which we transform and scale matter for our data? Is there any recommended literature on this? submitted by /u/Ambitious-Pay6329 [link] [comments]  ( 9 min )
    [D] GPU-compatible SNN-libraries in 2023?
    Hello, I am currently using snnTorch for a video classification task and I achieve fine results, however the training process is really, really slow. I was hoping to utilize my GPU for this task, and while there seem to be alternatives I was hoping to see if anyone will vouch for any of these, or different one: https://github.com/norse/norse https://github.com/BindsNET/bindsnet https://github.com/fangwei123456/spikingjelly https://github.com/UCI-CARL/CARLsim6 My priorities are in order: Windows support Potential transferability to in-memory compute hardware PyTorch compability submitted by /u/SlayahhEUW [link] [comments]  ( 9 min )
    [R] Meta AI: Towards a Real-Time Decoding of Images from Brain Activity
    Brain decoding tech has improved a lot recently thanks to AI/ML, enabling reading out visual perceptions from fMRI brain scans. But fMRI is too slow for real-time BCIs. A new study from Meta's AI research team pushes brain reading into real-time using MEG, which measures whole-brain activity at super-fast millisecond resolution. They built a 3-part pipeline to decode MEG signals: Embed images into latent spaces using pretrained models like CLIP. Train MEG-specific ConvNet to predict embeddings from MEG data. Generate images from MEG embeddings with diffusion model. They tested it on 20k+ natural images. MEG decoding was 7X better than old methods, hitting 70% top-5 accuracy in retrieving the right images. Generated images matched semantics decently but lacked fine visual details compared to fMRI. MEG seems more focused on high-level category info whereas fMRI captures more low-level features. This could enable visual BCIs for paralysis, etc. ... honestly, a world where we can decode brain images in real time is pretty crazy. The findings also raise some important ethical considerations around privacy of decoded mental content... (wow, that was a weird sentence to write!). TLDR: New MEG pipeline decodes dynamic visual data from brain activity in real-time. Good but not yet photorealistic-quality image generation. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Can someone ELI5 the birch clustering algorithm?
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html I'm looking at the parameters here and I'm confused on how there is no distance metric? What is assumed about the data going in if there is no distance metric or precomputed distance option? For example, can I run this with binary data (1/0), what about data w/ missing values? Does it assume the samples are normally distributed? submitted by /u/o-rka [link] [comments]  ( 9 min )
    [R] xVal: A Continuous Number Encoding for Large Language Models - The Polymathic AI Collaboration 2023 - Using the numbers directly instead of tokenizing them increases performance significantly!
    Paper: https://arxiv.org/abs/2310.02989 Twitter discussion: https://x.com/andrew_n_carr/status/1714326003030638848?s=20 Shows in my opinion that tokenizers are clouding the understanding of LLMs and that using the data directly is better. https://x.com/karpathy/status/1657949234535211009?s=20 Karpathy thinks the same! Abstract: Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose XVAL, a numerical encoding scheme that represents any real number using just a single token. XVAL represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that XVAL is more token-efficient and demonstrates improved generalization. https://preview.redd.it/qq8u066smzub1.jpg?width=1344&format=pjpg&auto=webp&s=498be8488c00147f0a7443050519dcf535fae126 https://preview.redd.it/dxqd4wpsmzub1.jpg?width=1499&format=pjpg&auto=webp&s=266689a80b31cb31fdc4167043f7abdb4f683100 https://preview.redd.it/0yy93xpsmzub1.jpg?width=1497&format=pjpg&auto=webp&s=b5eae8b958f03afc3c8c85a95c115e48aed1d06e submitted by /u/Singularian2501 [link] [comments]  ( 9 min )
    [P] A Guide to Building LLM-Based Applications with Code Llama
    Have you ever wondered about how to take advantage of the power of large language models (LLMs) and Generative AI at the edge? Our latest blog, A Guide to Building LLM-Based Applications with Code Llama, shows you how you can use Code Llama on an edge device to build a customized dashboard application. This tutorial shows how Code Llama can empowering analysts in remote, restricted environments to build applications in environments with minimal connectivity and compute capacity. In this tutorial, we’ll walk you through how to run code Llama on an edge device in a remote location to build a customized dashboard application. submitted by /u/modzykirsten [link] [comments]  ( 9 min )
    [R] LLMs can threaten privacy at scale by inferring personal information from seemingly benign texts
    Our latest research shows an emerging privacy threat from LLMs beyond training data memorization. We investigate how LLMs such as GPT-4 can infer personal information from seemingly benign texts. The key observation of our work is that the best LLMs are almost as accurate as humans, while being at least 100x faster and 240x cheaper in inferring such personal information. We collect and label real Reddit profiles, and test the LLMs capabilities in inferring personal information from mere Reddit posts, where GPT-4 achieves >85% Top-1 accuracy. Mitigations such as anonymization are shown to be largely ineffective in preventing such attacks. Test your own inference skills against GPT-4 and learn more: https://llm-privacy.org/ Arxiv paper: https://arxiv.org/abs/2310.07298 WIRED article: https://www.wired.com/story/ai-chatbots-can-guess-your-personal-information/ submitted by /u/bmislav [link] [comments]  ( 9 min )
    [D] GAN that manipulates shape, texture, color, position, angle
    I remember seeing a paper on manipulating or changing an objects attributes, it came out rather recently and seemed to work really well. But I just can’t find it anymore. All I know of is the „Counterfactual Generative Networks“ by A. Sauer & A. Geiger (2020) I’d really appreciate it if anyone can share similar work. Especially if causally motivated submitted by /u/Glittering_teapot [link] [comments]  ( 9 min )
    [P] Best Way to Create a Custom Chatbot from Personal Data (PDF, etc.)
    Hello fellow Redditors! I am looking for some guidance on creating a custom chatbot using my own data, which is currently in PDF format. I've explored various options like Azure, Pinecone, and I've heard about the AskYourPDF API, but I'm not sure which one would be the best fit for my project. I want to keep things simple, so I'm reaching out to the community to ask for recommendations or advice on the easiest and most effective way to build a website with a personalized chatbot. If you have experience with similar projects or know about user-friendly tools or platforms, please share your insights. I appreciate any suggestions, tips, or pointers you can provide. Thank you in advance for your help! TL;DR: Need advice on the simplest way to create a website with a personalized chatbot using my own data (PDF format). Seeking recommendations and tips from the community. ​ Thank you! submitted by /u/Huge-Number-4299 [link] [comments]  ( 9 min )
    [P] Where do I gather the dataset for my FYP
    I am doing a Machine Learning project for my FYP; I haven't worked on any ML project yet but I am excited about it. It is related to voice/facial emotion detection. is there any platform that provides datasets for ml projects? Like without any copyright issues (if that's even a thing in ml datasets idk?) A total beginner here. submitted by /u/fewdiepie_ [link] [comments]  ( 9 min )
    [P] I made a finetune of CodeLlama to resolve merge conflicts!
    I made a finetune of CodeLlama-7b for resolving merge conflicts following up on an IEEE study from 2022. The demo is here if anyone wants to check it out and give some feedback. It would help a ton for future versions improving the dataset and going forward with the 13b and 34b models submitted by /u/codys12 [link] [comments]  ( 9 min )
    [Discussion] how much 'error' should i apply when training with synthetic data?
    hi there ​ i'm trying to build a small ai that formats texts. ​ of course the current formatting applications applied on ide, search engine, ms softwares, notetaking apps are well functioning, but this is more for educational purpose & self interest. ​ since i don't have infinite amount of time and money, i'm thinking of using open sourced text data and generate synthetic data using gpt3.5 or somekind of algorithm to unformat them. ​ so this is the part where i'm stuck. when adding some errors such as inappropriate multilines, tabs, typos, how much should i add on to? ​ it would be best if i knew somekind of distribution of text errors people make on everyday life, but i don't have any. ​ i don't want to make this training too hard so i'm not really thinking to destroy the text, but rather add some appropriate level of errors. ​ but, would it help this ai model to learn better if i add extra errors? ​ or is this all just something i would have to figure out by myself? ​ any comments would be appreciated! submitted by /u/Strange_Dog8104 [link] [comments]  ( 9 min )
    [R] Open-source video translate solutions
    Hi there! are there any open-source solutions for video translation? i mean replacing video's audio stream with translated one in different language (which is in sync with the picture) - not necessarily alter mouth movements in the video. submitted by /u/curryprogrammer [link] [comments]  ( 9 min )
    [Research] Literature survey query
    Survey papers Hi all, First time posting here. I am doing my PhD in Language Conditioned Robotics. I am currently writing a literature review paper on the current state of the field and how it can be further improved. I am covering topics such as generative AI and LLMs in there. I would be more than grateful if you could send some literature review papers in the field of ML so I understand how to structure and write my paper and also what I should focus on mode. It doesn't necessarily have to be related to my PhD topic (but if they are it will help quite a bit). I would be more than happy if anyone can also share their experience. Thank you for your time! submitted by /u/bizzonkiller [link] [comments]  ( 9 min )
    [D] What are some of the best library frameworks to use for speech2text and text2speech AI chatbot
    Hey guys, what are some of the best library or libraries to use to make a voice conservational AI chatbot? I googled around and found Vocode. They look pretty good. However Vocode rely on several other (paid) closed sourced libraries such as Deepgram (for transcribing) and Azure AI Speech (for synthesising). Are there any other libraries/frameworks available out there which are completely or more open sourced? submitted by /u/redd-dev [link] [comments]  ( 9 min )
    [R] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS
    submitted by /u/Username912773 [link] [comments]  ( 9 min )
    6DOF Sim RL Capability [P]
    I have a 6DOF simulink model of a Autonomous underwater vehicle that has properties [u v w p q r x y z phi theta psi] and two inputs [theta1 theta2] that govern the angle of control surfaces. Ocean current and depth are taken into account. How feasible would it be to use RL to reach waypoints at various [x, y, z] positions? I don’t want to use a PID controller or anything, not even RL to tune a controller. The agent would choose the theta inputs directly. I have a feeling hyper paremeter tuning might play a larger role in this? I expect training times to increase exponentially as well? I have done this using a single randomly spawned waypoint with a simple Unicycle Kinematic model, in both simulink/matlab and python with a vectorized/parallel environment using SB3/PettingZoo/Gym. submitted by /u/VisionZUS [link] [comments]  ( 9 min )
    [R] BitNet: Scaling 1-bit Transformers for Large Language Models
    Arxiv link – BitNet: Scaling 1-bit Transformers for Large Language Models In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits. submitted by /u/PantsuWitch [link] [comments]  ( 9 min )
    [P] Achieving peak performance on GPU
    Hi r/MachineLearning! I recently went into the CUDA programming rabbit hole. In the process, I came across matrix multiplication and was amazed by how complicated the algorithm is in CUDA (especially if you want to get the best performance). I found the learning process quite gruelling (the CUDA docs were very average), so I wrote a tiny blog which hopefully helps anyone in the same position. You can read the blog on Medium (no paywall) or HackMD. It would probably be quite useful if you want to get a deeper intuition of how things like OpenAI Triton or FlashAttention work under the hood. Accompanying this is an implementation of a 3-hidden-layer MLP trained on MNIST in pure CUDA. Benchmarking this against PyTorch, it gets up 6x higher end-to-end training speed for small (h=128) networks, and asymptotically 20% faster for large (h=8192) ones! https://preview.redd.it/txx2txbvlzub1.png?width=2400&format=png&auto=webp&s=7bb136b9fb535bc58fd7ee809bbbca6f68dc8953 It's worth noting that I tried reasonably hard optimising the PyTorch implementation by using full fp16, torch.compile with fullgraph=True, mode="max-autotune", and pre-loading all data to GPU up-front (I also did this for the CUDA implementation). The main takeaways I got are: For small networks, PyTorch/Python still incurs a significant overhead, even if you try pretty hard to optimise it. For large networks, most of the speedup comes from using fp16 accumulation for matrix multiplication (instead of PyTorch's fp32). This obviously reduces stability, but at least in my case, I didn't observe any numerical issues. In cases where we can get away with fp16, we might be leaving a significant amount of performance on the table! Anecdotally, you have to try really hard in CUDA to even get close to the performance of PyTorch, but it is possible to beat it if you try hard (suffer) enough. You can check out the repo here: https://github.com/andylolu2/cuda-mnist. Would love to hear some feedback! submitted by /u/bjergerk1ng [link] [comments]  ( 10 min )
  • Open

    Measurement-induced entanglement phase transitions in a quantum circuit
    Posted by Jesse Hoke, Student Researcher, and Pedram Roushan, Senior Research Scientist, Quantum AI Team Quantum mechanics allows many phenomena that are classically impossible: a quantum particle can exist in a superposition of two states simultaneously or be entangled with another particle, such that anything you do to one seems to instantaneously also affect the other, regardless of the space between them. But perhaps no aspect of quantum theory is as striking as the act of measurement. In classical mechanics, a measurement need not affect the system being studied. But a measurement on a quantum system can profoundly influence its behavior. For example, when a quantum bit of information, called a qubit, that is in a superposition of both “0” and “1” is measured, its state will sudde…  ( 94 min )
  • Open

    Institute Professor Daron Acemoglu Wins A.SK Social Science Award
    The award honors research on public policy with a focus on economic and governmental reforms.  ( 7 min )
  • Open

    Optimize pet profiles for Purina’s Petfinder application using Amazon Rekognition Custom Labels and AWS Step Functions
    Purina US, a subsidiary of Nestlé, has a long history of enabling people to more easily adopt pets through Petfinder, a digital marketplace of over 11,000 animal shelters and rescue groups across the US, Canada, and Mexico. As the leading pet adoption platform, Petfinder has helped millions of pets find their forever homes. Purina consistently […]  ( 9 min )
  • Open

    Understanding the user: How the Enterprise System Usability Scale aligns with user reality
    This position research paper was presented at the 26th ACM Conference on Computer-Supported Cooperative Work and Social Computing (opens in new tab) (CSCW 2023), a premier venue for research on the design and use of technologies that affect groups, organizations, and communities. In the business world, measuring success is as critical as selecting the right […] The post Understanding the user: How the Enterprise System Usability Scale aligns with user reality appeared first on Microsoft Research.  ( 10 min )
  • Open

    NVIDIA Expands Robotics Platform to Meet the Rise of Generative AI
    Powerful generative AI models and cloud-native APIs and microservices are coming to the edge. Generative AI is bringing the power of transformer models and large language models to virtually every industry. That reach now includes areas that touch edge, robotics and logistics systems: defect detection, real-time asset tracking, autonomous planning and navigation, human-robot interactions and Read article >  ( 8 min )
    Making Machines Mindful: NYU Professor Talks Responsible AI
    Artificial intelligence is now a household term. Responsible AI is hot on its heels. Julia Stoyanovich, associate professor of computer science and engineering at NYU and director of the university’s Center for Responsible AI, wants to make the terms “AI” and “responsible AI” synonymous. In the latest episode of the NVIDIA AI Podcast, host Noah Read article >  ( 6 min )
    Into the Omniverse: Marmoset Brings Breakthroughs in Rendering, Extends OpenUSD Support to Enhance 3D Art Production
    Real-time rendering, animation and texture baking are essential workflows for 3D art production. Using the Marmoset Toolbag software, 3D artists can enhance their creative workflows and build complex 3D models without disruptions to productivity.  ( 7 min )
    Foxconn and NVIDIA Amp Up Electric Vehicle Innovation
    NVIDIA founder and CEO Jensen Huang joined Hon Hai (Foxconn) Chairman and CEO Young Liu to unveil the latest in their ongoing partnership to develop the next wave of intelligent electric vehicle (EV) platforms for the global automotive market. This latest move, announced today at the fourth annual Hon Hai Tech Day in Taiwan, will Read article >  ( 6 min )
  • Open

    Portable sed -i across MacOS and Linux
    The -i flag to ask sed to edit a file in place works differently on Linux and MacOS. If you want to create a backup of your file before you edit it, say with the extension .bak, then on Linux you would run sed -i.bak myfile but for the version of sed that ships with […] Portable sed -i across MacOS and Linux first appeared on John D. Cook.  ( 6 min )
  • Open

    Best Books to Learn Neural Networks in 2023 for Beginners (Updated) -
    submitted by /u/Lakshmireddys [link] [comments]

  • Open

    Roughly how much time will a task running on a RTX 3060 take VS a ~i7 CPU? [Discussion]
    Anyone have examples of tasks run between the two? Doesn't need to be exact. submitted by /u/Apita2000 [link] [comments]  ( 8 min )
    [D] Feedback on my MVP project - Pre-Recorded Standardized Video Interviews Job Site for Data Professionals
    Hey! ​ Startup: - Apply Script dot com "Connect business and data professionals via pre-recorded standardized video interviews." ​ More details: ​ Problems with Traditional Hiring ​ - Outdated: The current method of conducting interviews has become overly complex and outdated. - Time-Wasting: The process involves too many appointments, meetings, and stages, leading to communication errors. - Expensive: The man-hours invested by HR and engineering teams are costly. - Constraining: Interviews are fixed to specific times and locations. - Cumbersome: The experience is challenging for both businesses and professionals. ​ Our Solution ​ + Talent Identification: We find top talent that matches your job post. + Standardized Interviews: Professionals standardized pre-record their …  ( 9 min )
    [D] Help identifying research papers for online / cyclic / sequential learning?
    So my situation is that I have a pretrained model and we get a new update of data every month (note: this monthly data is very small compared to the original dataset, the original dataset was about 5 years worth, or ~60x the size of any given monthly update), how can I update my pretrained model on the much smaller set of new data, learning from the data without overfitting to that data? Or frankly, what would be better if it is possible, would be to extend my pretrained model such that it learns from the new data and then can be more tightly fit to that month's data. So something like meta-learning or local fine-tuning, but I want to continue to update and improve my pretrained model so that I have a base model that can do well on each month's new data. Does anyone know anything like this, or have advance for terms to look into, beyond just transfer learning or regularization? submitted by /u/Amun-Aion [link] [comments]  ( 9 min )
    How to properly implement Cover's Theorem in an SVM? [P]
    Maybe this belongs elsewhere since it's probably a dumb basic question, but basically I'm taking an undergrad course in AI and we've been given a classification problem. We were told as a "hint" to recall Cover's Theorem when separation fails, but the issue is she also wants us to draw a rough sketch of the data with the separator. Mine failed in a basic scatterplot so I upped the dimension by 1 but it also wasn't separable in R3 (which is annoying to draw anyway but could have been done), if I keep going then it might work at some point but idk how I'm meant to draw the data if it's separated in R4 or beyond. If it works in R4 do I just sketch the data in R3 and just draw a 3 dimensional point where w = 0? But even then if it goes beyond R4 it becomes way more annoying. So I'm assuming my implementation is just wrong, maybe the formula I used was wrong. Can someone show what a proper implementation looks like and how we're meant to up dimensions? Don't wanna post what I tried bc it has starter code and stuff baked into it which might allow my professor to find this post 😂 submitted by /u/Traditional_Land3933 [link] [comments]  ( 9 min )
    [D] Cross Entropy Classification vs Metric Learning + k-NN for image classification?
    Hi guys. We've all seen how hot RAG and vector DBs have been lately. How good are retrieval-based approaches for image classification? More concretely: Suppose we have a network trained with metric learning and a massive, diverse set of labelled examples to retrieve from. We've just been tasked to do classification with a fixed number of classes, and we've narrowed it down to two options: Embed our dataset using our metric learning network, throw the embeddings into a vector DB, and do k-NN Train a classifier via cross-entropy loss Which approach would we expect to provide better performance? What are the trade-offs? Any insight is appreciated! submitted by /u/supersmartypants [link] [comments]  ( 9 min )
    [D] Graph Neural Networks - Links Prediction Task on Directed, Heterogenous Multigraphs
    Hi guys, I have the following use case at hand for my thesis, and I'd like to ask for some help to formulate my problem: A directed multigraph (1 node type, multiple edge types) Each node and edge have their own attributes A set of graphs that are fully labeled. The dataset is self-created according to some technical rules. Training is supposed to be done on this dataset. My task is to perform link prediction in the inductive setting. This means that given an unseen incomplete graph at the inference time, the model should be able to predict all the missing links. I have read many papers and tried to formulate my problem in many directions. Since I am also new to GNNs, I would prioritize papers with an existing codebase and sound theoretical justifications for the techniques (which …  ( 10 min )
    Trouble improving accuracy in face recognition dataset [P]
    Hey everyone Im trying my hands with the The Labeled Faces in the Wild face recognition dataset, for a face recognition task. I have made a siamesemodel, and my loss curve is looking great but my accuracy stays at 0.500, for everything i have tried. Is there anybody in here that have tried their hands with this task before that can give me some tips to improve my accuracy. I am implementing it in python with PyTorch btw Thanks in advance! submitted by /u/Due_Concentrate1279 [link] [comments]  ( 9 min )
    How valuable is a PhD in science (with applied ML) compared to a PhD in only Machine learning [D]
    Is it more advantageous to pursue a PhD in machine learning with a focus on scientific applications for example (Machine learning for drug design) if the end goal is to work in the machine learning industry? Or is a general PhD in machine learning more valuable for this career path? Thank you submitted by /u/Neat-Print2792 [link] [comments]  ( 9 min )
    [R] 85% of the variance in language model performance is explained by a single factor (g, a unified measure of LLM ability)
    TL;DR and paper link are at the bottom of the post. I'm an undergrad who just wrote my first paper completely solo. Crazy experience with so many highs and lows, but I learned a lot from it. I think the results are important and I want people to see them, so I'll try to walk through the paper here as best as I can. I also have a small request for Arxiv enjoyers at the end. Given the nature of Reddit posts, I'll focus a bit less on the methods and more on the results. I won't cite stuff here either, but obviously you can find citations in the paper. First I'll give a small bit of historical context to what I'm doing, then walk through what I did and what came of it. Enjoy the read. The general intelligence factor in humans In the early 1900s, Charles Spearman observed that children's …  ( 14 min )
    [D] How to design API of Machine learning library
    In the past nine years of my deep learning journey, I have come across a vast number of frameworks. Lua Torch was a fantastic framework that initially died due to a lack of Python's ecosystem, but then rose again as PyTorch. Theano was also a great framework, but its major drawback was difficult debugging. I remember spending two weeks writing a Neural Turing Machine for solving bAbI tasks on theano. (Nowadays, it would take a couple hours on Pytorch). Tensorflow - I still don't understand what that was, a terrible framework. There was also Caffe, which was popular in computer vision. Julia is another language that attempted to introduce automatic differentiation as a built-in feature. And JAX, which I was originally biased against since it's a Google product. But some close friends persuaded me to try it, and I actually liked it. However, I thought that it would be difficult for JAX to gain widespread adoption in the community, as PyTorch already had a strong network effect and was gaining traction quickly. I didn't see how anyone could catch up with PyTorch. Another issue with JAX is that it requires additional cognitive load for developers. Take a look: https://higgsfield.substack.com/p/how-to-design-api-of-machine-learning submitted by /u/Good-Willingness-985 [link] [comments]  ( 9 min )
    [P] 2D Gaussian Splatting a great starting point for people who want to delve deeper
    Github : https://github.com/OutofAi/2D-Gaussian-Splatting https://i.redd.it/cwgsjtko1sub1.gif submitted by /u/TerryCrewsHasacrew [link] [comments]  ( 8 min )
    [D] How to Build Data Products? Deploy: Part 3/4 - Doubling down on the power of Unified Experiences for building state of the art models.
    Data products plays an important role in building state of the art machine learning models. Though their building process seems a bit confusing within industry as of now, this article series tries to simplify it by breaking it and explaining it into 4 steps. Take a look: https://moderndata101.substack.com/p/how-to-build-data-products-deploy What processes are being followed at your org for building scalable data products? submitted by /u/growth_man [link] [comments]  ( 9 min )
    Shared Public Contextual Database for RAG [D]
    Hey Guys, It seems RAG is really taking off as an increasingly popular use case for LLMs to leverage contextual data. However, everybody is building their own contextual data sets and embedding them in their own silo'd vector dbs. Do you guys think there's any utility in having a shared public vector db that anyone can tap into their API, without having to self-host, worry about the embedding pipelines and filling the vector db with enough data in the first place for their use cases? Would this save devs alot of time in quickly testing testing product ideas? (albeit it does seem that propriety data is what everyone's raving about today) - For context, I'm building a social media product we're users can upload a few pieces (approx 10) of content (social media posts, websites, videos to start with), which becomes the verified human-curated list/Niche. We then classify and embed this into a vector db. From this, we have set up a data pipeline to scrape the web and find new content that is most similar which we suggest to users to add to the Niche (upvote, downvote style). When a piece of content is upvoted on its added to the verified list updating the Niche's classification string. Essentially we're aiming to construct an ever-growing, user-curated, contextually classified vector database from a relatively small set of sample data. submitted by /u/niksteel123 [link] [comments]  ( 9 min )
    [D] Work regarding using LLMs to generate data for downstream tasks.
    Hi. I'm curious if there have been any studies done regarding the effects of using data generated by LLMs for other downstream tasks. The closest that I could find are the two papers: Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias (Yu et al., 2023) Generating Training Data with Language Models: Towards Zero-Shot Language Understanding (Meng et al., 2022) The former focuses on studying the differences between the type of prompts that are used to generate the data and the latter doesn't use LLMs. Doesn't have to be papers, blog posts or any sort of information regarding the scenario I described is fine. Thanks. submitted by /u/Seankala [link] [comments]  ( 9 min )
    [D] Embedding models in production(CPU w/ high throughput)
    Hello, I am working on an app that requires creating lots of text embeddings(100M tokens). Looking at OpenAI Ada pricing(and considering that my app doesn't yet make any money) I'm looking into self-hosting a model to run on CPU. I know that constrains me towards smaller models-- so far locally I've been testing with sentence-transformers/all-MiniLM-L6-v2 and the query results seem okay-ish enough for my MVP. (Although, I should not that I haven't compared how embeddings with other models would perform.) Does anyone have experiences doing something similar? In particular, I'd love to hear about any tips you have for maximizing no. of embeddings / second. (new to ML/MLOps, so apologies if this is a silly question :) submitted by /u/rsamrat [link] [comments]  ( 9 min )
    [N] Introducing Stable Fast: An ultra lightweight inference optimization library for HuggingFace Diffusers on NVIDIA GPUs
    What is this? stable-fast is an ultra lightweight inference optimization library for HuggingFace Diffusers on NVIDIA GPUs. stable-fast provides super fast inference optimization by utilizing some key techniques and features: CUDNN Convolution Fusion: stable-fast implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations of Conv + Bias + Add + Act computation patterns. Low Precision & Fused GEMM: stable-fast implements a series of fused GEMM operators that compute with fp16 precision, which is fast than PyTorch's defaults (read & write with fp16 while compute with fp32). NHWC & Fused GroupNorm: stable-fast implements a highly optimized fused NHWC GroupNorm + GELU operator with OpenAI's triton, which eliminates the need…  ( 9 min )
    [R] Does the Flan T5 decoder take the question as input ?
    Hello, I was looking at the Flan T5 paper and code. It was clear that the question (instruction) and the context are given to the encoder as input. But I find no details on what does the decoder take as input apart from the fact that it starts with the pad token. Anyone can give me more details please ? Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 9 min )
    [D] TensorFlow.js and state of the ecosystem for JavaScript
    I am curious about the state of the ecosystem for JavaScript, where TF looks like a reasonably solid option. Options I have found so far are: TensorFlow.js (looks like the most complete solution, but the general sentiment about TF in Python is pretty bad!) MediaPipe (to quickly implement specific use cases it seems, maybe using tf.js in the background?) ml5.js (a layer on top of tf.js to make it more approachable if i understand correctly) transformers.js (haven't quite grasped this one) shumai (bun only, so server side only) I am curious to read informed opinions about these and more! Have you used them and how? ​ submitted by /u/gtnbssn [link] [comments]  ( 9 min )
    [D] Which raw OPS/s benchmarks best reflect ML/DL workloads?
    Hi all. I'm preparing an open website about products used for AI/ML/DL computation (no in-house testing for now, just the database and GUI). However comparing raw speed of products of different vendors is more challenging than I anticipated, because there are many possible raw performance indicators, only few of which are provided by vendors. For example a raw performance indicator can be "FP32 vector with opportunistic optimization", while another can be "BF16 matrix/tensor without opportunistic optimization". A full picture of raw performance would be fully represented only by a table with multiple dimensions: Number format (FP64, FP32, TF32, FP16, BF16, FP8, INT8, INT4... are the others?) Vector vs. matrix/tensor operation (boolean) Opportunistic optimizations like Nvidia Sparsity…  ( 9 min )
    [D] Interesting loss graphs
    Wondering if anyone has some interesting loss graphs that they could share. Maybe loss suddenly dropped after 100 epochs, or a local minima was found and then it jumped into a lower one. Wondering if anyone forgot to turn off training and cam back to an improved result than what they thought had already been converged to. submitted by /u/HStuart18 [link] [comments]
  • Open

    Thoughts on new ChatGPT features
    I've had access to Dall-3, Vision and voice chat features, and I've been blown away by how impressive each of the new features are. Dall-E 3 seems roughly comparable to Midjourney in overall image quality, but does a much better job at understanding the prompt. The vision model continues to surprise by how well it is able to understand images at a seemingly human level of comprehension. And the voice chat is such an intuitive and captivating way of interacting with ChatGPT, it felt like I was interacting with one of the AI assistants from the movie "Her". However, it's unfortunate that these amazing new features cannot be used together at the same time. Up until gaining access to these features, I had been using the advanced data analysis model as my default, which is great for helping with programming tasks. I can only imagine how revolutionary ChatGPT will be when a cohesive multi-modal model is released sometime in the near future which has all these capabilities available from the start. What things would you want to try if such a cohesive model was released? I can already imagine some use cases where you could set up iterative improvement for things like interface design, which some people have already got to work with just the base vision model by itself. submitted by /u/ImRealNow [link] [comments]
    U.S. Tightens China's Access to Advanced Chips for Artificial Intelligence
    The Biden administration has announced additional limits on sales of advanced semiconductors by American firms to China, in an effort to restrict China's progress on supercomputing and artificial intelligence. The new rules will likely halt most shipments of advanced semiconductors from the United States to Chinese data centers, which use them to produce models capable of artificial intelligence. Chip makers seeking to sell China advanced chips or the machinery used to make them will be required to notify the government of their plans or obtain a special license. To prevent the risk of advanced U.S. chips reaching China through third countries, chip makers will also need licenses to ship to other countries subject to U.S. arms embargoes. The Biden administration argues that China's access to advanced technology is dangerous as it could aid the country's military in tasks like guiding hypersonic missiles or cracking top-secret U.S. codes. The restrictions may affect Chinese companies developing AI chatbots and could weaken China's economy in the long run, as AI is transforming industries from retail to healthcare. The limits are also expected to impact sales to China of U.S. chip makers such as Nvidia, AMD, and Intel, who earn a significant portion of their revenue from Chinese buyers. The rules will exempt chips used in commercial applications like smartphones, laptops, electric vehicles, and gaming systems. The Semiconductor Industry Association, which represents major chip makers, is evaluating the impact of the updated rules. The Biden administration has been trying to counter China's growing mastery of cutting-edge technologies by investing in new chip factories in the U.S. while setting restrictions on exports of technology to China. Source : https://www.nytimes.com/2023/10/17/business/economy/ai-chips-china-restrictions.html submitted by /u/NuseAI [link] [comments]
    Google: Data-scraping lawsuit would take 'sledgehammer' to generative AI
    Google has asked a California federal court to dismiss a proposed class action lawsuit that claims the company's scraping of data to train generative artificial-intelligence systems violates millions of people's privacy and property rights. Google argues that the use of public data is necessary to train systems like its chatbot Bard and that the lawsuit would 'take a sledgehammer not just to Google's services but to the very idea of generative AI.' The lawsuit is one of several recent complaints over tech companies' alleged misuse of content without permission for AI training. Google general counsel Halimah DeLaine Prado said in a statement that the lawsuit was 'baseless' and that U.S. law 'supports using public information to create new beneficial uses.' Google also said its alleged use of J.L.'s book was protected by the fair use doctrine of copyright law. Source : https://www.reuters.com/legal/litigation/google-says-data-scraping-lawsuit-would-take-sledgehammer-generative-ai-2023-10-17/ submitted by /u/NuseAI [link] [comments]
    [AI Dad Joke] Why did the AI stop being nice?
    It regressed to mean... PS: I read the sidebar which didn't exclude humor, and the flair seems to suggest that it would be okay, but my apologies if not. submitted by /u/Tyler_Zoro [link] [comments]
    👨🏻‍🏫 Generative AI Security Standards, LLM‘s 200K Context Window, Alibaba's Open-Source Obsession, and Baidu World 2023
    submitted by /u/trcytony [link] [comments]
    Can GPT models be financial analysts? ChatGPT, GPT-4 fail CFA exams in new study by JP Morgan, Queens University, and Virginia Tech
    Researchers evaluated ChatGPT and GPT-4 on mock CFA exam questions to see if they could pass the real tests. The CFA exams rigorously test practical finance knowledge and are known for being quite difficult. They tested the models in zero-shot, few-shot, and chain-of-thought prompting settings on mock Level I and Level II exams. The key findings: GPT-4 consistently beat ChatGPT, but both models struggled way more on the more advanced Level II questions. Few-shot prompting helped ChatGPT slightly Chain-of-thought prompting exposed knowledge gaps rather than helping much. Based on estimated passing scores, only GPT-4 with few-shot prompting could potentially pass the exams. The models definitely aren't ready to become charterholders yet. Their difficulties with tricky questions and core finance concepts highlight the need for more specialized training and knowledge. But GPT-4 did better overall, and few-shot prompting shows their ability to improve. So with targeted practice on finance formulas and reasoning, we could maybe see step-wise improvements. TLDR: Tested on mock CFA exams, ChatGPT and GPT-4 struggle with the complex finance concepts and fail. With few-shot prompting, GPT-4 performance reaches the boundary between passing and failing but doesn't clearly pass. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]
    Let's find out what GPT4 vision can do
    GPT4 vision isn't just a gimmick. We've been given a new superpower, and so we must "deal with it". This is probably as big a moment as when chatGPT first arrived, maybe more. Machine Vision for the masses (and more). I tried doing some very loose sketches, and it really struggled to identify them until they were coloured in. Humans could easily what they were. But, in order to see what uses it has, we need to know what capabilities it does and does not have. Pick a question and see what you can learn! can it use TINY images (I assume they are much faster) can it tell you what has changed in two images? can it measure distances ? (with perspective?) can it make 3d models from instructions? can it "learn" to recognise people/ similar objects (in the same context window) what limits are there to exhaustive listing exhaustive description is it better at details or overviews can it read maps / graphs / text how smart is it on DIY / xrays / mechanics can it follow wires?? (Can it find lego) is there a formal reference system you can use (X/Y) can it give co-ordinates in large grids or grid-like (how un-grid like) ie film strip, or window-panes can it navigate a 2d maze turn-by turn? 3d maze? can that be insanely complex? can it make ebay descriptions (condition) can it estimate food weight can it estimate strength / angles / volume can it create programs from screenshots. Can it use programs? games? control RC car / robot? what kind of language / instructions are best when talking about images. what other questions do we need submitted by /u/inteblio [link] [comments]
    AI pioneers LeCun, Bengio clash in intense online AI safety, governance debate
    Yann LeCun and Yoshua Bengio, two influential figures in AI and deep learning, engaged in a heated debate over the potential risks and safety concerns surrounding AI. LeCun emphasized the need to design AI systems for safety rather than imagining catastrophic scenarios. Bengio argued for the importance of prudence, stating that we still do not understand how to design safe, powerful AI systems, and highlighted the need for major investment in AI safety and governance. The debate highlighted the disagreement among esteemed researchers about AI's potential risks, the effectiveness of current safety measures, and the best path forward. The implications of AI, including job displacement, privacy violations, and existential risks, have become a topic of widespread concern. Source : https://venturebeat.com/ai/ai-pioneers-yann-lecun-and-yoshua-bengio-clash-in-an-intense-online-debate-over-ai-safety-and-governance/ submitted by /u/NuseAI [link] [comments]
  • Open

    Learn how Amazon Pharmacy created their LLM-based chat-bot using Amazon SageMaker
    Amazon Pharmacy is a full-service pharmacy on Amazon.com that offers transparent pricing, clinical and customer support, and free delivery right to your door. Customer care agents play a crucial role in quickly and accurately retrieving information related to pharmacy information, including prescription clarifications and transfer status, order and dispensing details, and patient profile information, in […]  ( 8 min )
    Keeping an eye on your cattle using AI technology
    At Amazon Web Services (AWS), not only are we passionate about providing customers with a variety of comprehensive technical solutions, but we’re also keen on deeply understanding our customers’ business processes. We adopt a third-party perspective and objective judgment to help customers sort out their value propositions, collect pain points, propose appropriate solutions, and create […]  ( 16 min )
    Personalize your search results with Amazon Personalize and Amazon OpenSearch Service integration
    Amazon Personalize has launched a new integration with Amazon OpenSearch Service that enables you to personalize search results for each user and assists in predicting their search needs. The Amazon Personalize Search Ranking plugin within OpenSearch Service allows you to improve the end-user engagement and conversion from your website and app search by taking advantage […]  ( 7 min )
  • Open

    DSC Weekly 17 October 2023
    Announcements Top Stories In-Depth The post DSC Weekly 17 October 2023 appeared first on Data Science Central.  ( 20 min )
    Uncharted digital landscapes and the quest for timeless identity
    In a recent podcast episode, Lex Freedman and Mark Zuckerberg convened in the Metaverse, where the digital realm intertwines with reality. Their astonishingly realistic interaction, while highlighting technological advancements, also prompted deeper contemplations. As the line between digital recreations and reality becomes increasingly blurred, it beckons questions about the definitions of identity and consciousness and… Read More »Uncharted digital landscapes and the quest for timeless identity The post Uncharted digital landscapes and the quest for timeless identity appeared first on Data Science Central.  ( 22 min )
    Internet Of Things (IOT):  Application In Hazardous Locations
    Introduction to Internet of Things (IOT): Internet of Things (IoT) represents the fourth-generation technology that facilitates the connection and transformation of products into smart, intelligent and communicative entities. IoT has already established its footprint in various business verticals such as medical, heath care, automobile, and industrial applications. IoT empowers the collection, analysis, and transmission of… Read More »Internet Of Things (IOT):  Application In Hazardous Locations The post Internet Of Things (IOT):  Application In Hazardous Locations appeared first on Data Science Central.  ( 23 min )
    The digital evolution in aviation: how big data and analytics are transforming the industry
    Long before passengers sit back, relax, and enjoy their flight, data has played a critical role in getting them to their seats. It has been a cornerstone of the aviation industry since the early days of air travel. Indeed, from the early 20th century, data was collected through manual processes such as pilots logging information… Read More »The digital evolution in aviation: how big data and analytics are transforming the industry The post The digital evolution in aviation: how big data and analytics are transforming the industry appeared first on Data Science Central.  ( 20 min )
  • Open

    "STARC: A General Framework For Quantifying Differences Between Reward Functions", Skalse et al 2023
    submitted by /u/gwern [link] [comments]
    "Goodhart's Law in Reinforcement Learning", Karwoski et al 2023
    submitted by /u/gwern [link] [comments]
    Dynamic state and action space
    Hello, I’m working on a scenario that involves many systems and each system involves many subsystems. At each decision time and according to the system that requests the decision, the RL agent must select a subsystem. Nevertheless, each system has a different number of subsystems which makes the action space and the state space dynamic since the each neurone in the output represents a subsystem. Can I use the maximal number of subsystems (not the total number) as the number of the output and masking some neurones according to the current system ? submitted by /u/GuavaAgreeable208 [link] [comments]
    Offline rl- interpreting policy
    I am new to RL and have a naive question. How interpretable would the policy be from building a rl algorithm in an offline setting? Could I make inferences about what the optimal sequences would be? submitted by /u/kwsunshine123 [link] [comments]
  • Open

    Goal Representations for Instruction Following
    Goal Representations for Instruction Following Figure title. Figure caption. This image is centered and set to 50% page width. --> A longstanding goal of the field of robot learning has been to create generalist agents that can perform tasks for humans. Natural language has the potential to be an easy-to-use interface for humans to specify arbitrary tasks, but it is difficult to train robots to follow language instructions. Approaches like language-conditioned behavioral cloning (LCBC) train policies to directly imitate expert actions conditioned on language, but require humans to annotate all training trajectories and generalize poorly across scenes and behaviors. Meanwhile, recent goal-conditioned approaches perform much better at general manipulation tasks, but do not enable easy t…  ( 7 min )
  • Open

    Striking Performance: Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
    GeForce RTX and NVIDIA RTX GPUs, which are packed with dedicated AI processors called Tensor Cores, are bringing the power of generative AI natively to more than 100 million Windows PCs and workstations.  ( 7 min )
    NVIDIA RTX Video Super Resolution Update Enhances Video Quality, Detail Preservation and Expands to GeForce RTX 20 Series GPUs
    NVIDIA today announced an update to RTX Video Super Resolution (VSR) that delivers greater overall graphical fidelity with preserved details, upscaling for native videos and support for GeForce RTX 20 Series GPUs.  ( 7 min )
  • Open

    New technique helps robots pack objects into a tight space
    Researchers coaxed a family of generative AI models to work together to solve multistep robot manipulation problems.  ( 11 min )
  • Open

    Model metamers reveal divergent invariances between biological and artificial neural networks
    submitted by /u/Chipdoc [link] [comments]  ( 8 min )

  • Open

    Article: Key Concepts and Open Questions in a Golden Age for Natural Language Understanding
    submitted by /u/Stanford_Online [link] [comments]
    DexCatch: Learning to Catch Arbitrary Objects with Dexterous Hands
    🌟 Excited to share our recent research, DexCatch! Pick-and-place is slow and boring, while throw-catching is a behaviour towards more human-like manipulation. We propose a new model-free framework that can catch diverse objects of daily life with dexterous hands in the air. This ability to catch anything from a cup to a banana, and a pen, can help the hand quickly manipulate objects without transporting objects to their destination -- and even generalize to unseen objects. Video demonstrations of learned behaviors and the code can be found at https://dexcatch.github.io/. ​ https://reddit.com/link/17973ri/video/i4xdo39d4lub1/player submitted by /u/Shengjie_Wang [link] [comments]
    Help with Model Based Policy Optimization
    I am reading this paper and came across the following paragraph - ​ "Model usage. Many recent model-based algorithms have focused on the setting in which model rollouts begin from the initial state distribution (Kurutach et al., 2018; Clavera et al., 2018). While this may be a more faithful interpretation of Algorithm 1, as it is optimizing a policy purely under the state distribution of the model, this approach entangles the model rollout length with the task horizon. Because compounding model errors make extended rollouts difficult, these works evaluate on truncated versions of benchmarks. The branching strategy described in Section 4.2, in which model rollouts begin from the state distribution of a different policy under the true environment dynamics, effectively relieves this limitation. In practice, branching replaces few long rollouts from the initial state distribution with many short rollouts starting from replay buffer states." ​ What does state distribtion mean over here? Also in line 8 of the image, I don't understand what's the relation between model rollout and policy \pi_t. Is it saying, use the model free algorithm to take future steps from that state? What does the model have to do with that? ​ https://preview.redd.it/twlej5my3kub1.png?width=1182&format=png&auto=webp&s=4a515c8d237c963052bc1b60a9e7dda53a33f001 submitted by /u/Academic-Rent7800 [link] [comments]
    math prerequisites for reinforcement learning research?
    hi all! i’m an undergraduate that is really interested in pursuing a PhD. i think reinforcement learning is especially interesting, causal reinforcement learning in particular. for my current research job, which unfortunately doesn’t really involve ML, i read a little about causal inference and it really intrigued me. what mathematics courses should i take to get into RL research at a theoretical/algorithmic level? i am currently taking proof-based linear algebra, and have taken all the computational calculus offered. i imagine prob. theory/math stats is pretty important, too; what else? submitted by /u/treeman0469 [link] [comments]
  • Open

    [D] Exploring Methods to Improve Text Chunking in RAG Models (and other things...)
    Hello everyone, I'm currently working on Retrieval Augmented Generation (RAG) models and have developed a custom chunking function, as I found the methods in LangChain not entirely satisfactory. I'm keen on exploring other methods, algorithms (related to NLP or otherwise), and models to enhance text chunking in RAG. There are many RAG implementations out there, but I've noticed a lack of focus on improving chunking performance specifically. Are there any other promising approaches beyond my current pipeline, which consists of a bi-encoder (retriever), cross-encoder (reranker), and a Large Language Model (LLM) for interactions? For queries, I'm using both traditional and HyDE (Hypothetical Document Embedding) approaches in the retrieval phase, and sending the top 'n' results of both similarity search to the reranker. I've also tried using an LLM to convert the query into a series of 10-20 small phrases or keywords, which are then used as the query for the retriever model. However, the results vary depending on the LLM used. To generate good keywords (with a not extractive approach) , I had to use a "CoT" prompt, instructing the model to write self-instruct, problem analysis and reasonings before generating the required keywords. But this approach use lots of tokens, and requires careful scraping to ensure the model has used the right delimiter to separate reasoning and the actual answer. I'm also planning to modify the text used to generate embeddings, while returning the original text after the recall phase. But this is still a work in progress and scaling it is proving to be a challenge. If anyone has any tips or experience with this, I'd appreciate your input. I'd be grateful for any resources, repositories, libraries, or existing implementations of novel chunking methods that you could share. Or we could just discuss ideas, thoughts, or approaches to improve text chunking for RAG here. Thanks in advance for your time! submitted by /u/BXresearch [link] [comments]  ( 9 min )
    [D] Rate my GPU server for Deep Learning
    I started learning deep learning last year and decided to step up my game with regard to model training and tools. I recently built a GPU server. It’s still within its return period, so please help decide if it’s worth keeping: Processor: 2x Xeon E5-2690 v4 2.6GHz 14-Core Memory: 128GB GPU: 8x NVIDIA Tesla P100 16GB HBM2 Accelerator Card Total cost: ~$3200 submitted by /u/Stonks-Stocks [link] [comments]  ( 9 min )
    [N] "How to Apply to Grad School" webinars by CMU RI!
    We are hosting a few "How to Apply to Grad School" webinars this week. This is a chance to hear from faculty and students in the Robotics Institute at CMU on what life in grad school is actually like, as well as get some tips on crafting a strong application! https://cmu-ri-resources.github.io/ submitted by /u/bart-ai [link] [comments]  ( 9 min )
    [R] Microsoft presents Table-GPT: Table-tuned GPT for Diverse Table Tasks
    Tables pack tons of relational data but are tough for AI to grasp. They have complex 2D structure with information scattered across rows and columns. Models like GPT-3 fail basic tasks like finding where a missing value should go. LLMs struggle at this because they're pre-trained mostly on natural text, which is linear. Researchers at Microsoft wanted to mitigate this with "table-tuned" models, trained on table-related tasks. Their process: Automatically generate lots of diverse table-task training cases from a corpus of real-world tables. Ex: "impute missing value" or "identify error in table". Further augment data via paraphrasing, shuffling table rows/columns, chaining model responses, etc. This table-tuning produced "Table-GPT" models with substantially stronger table skills. In experiments, Table-GPT crushed vanilla GPT-3: 25%+ better on unseen table tasks like missing value ID and column type ID Beat GPT-3 on 98% of test cases across 9 different table tasks Stayed superior after downstream tuning too There's tons more work to do but seems pretty promising. Table-tuning boosted models' ability to comprehend tables and reason over tabular data vs just pre-training on text. TLDR: Training AI models more on synthesized table tasks ("table-tuning") significantly improves their table skills. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Text-to-pose?
    When are we getting a text-to-pose ai? I'd love to be able to generate poses for 3d models that match a given text description, because sometimes what my mind comes up with doesn't feel adequate. It's frustrating that I'm not seeing any developments in this area of ai, and I lack the skills to commence the developments myself. submitted by /u/BM09 [link] [comments]  ( 9 min )
    [R] Google Pali-3 Vision Language Models: Contrastive Training Outperforms Classification
    submitted by /u/currentscurrents [link] [comments]  ( 8 min )
    [D] Sources of esoteric data? Specifically looking for 6dof motion data from a medium to large oceangoing vessel underway in various sea states.
    I am ok with paying for the data. I just can't find any sources for it. I found some data on github that appears to come from container ships at port, but nothing for a ship underway. submitted by /u/jschall2 [link] [comments]  ( 9 min )
    [D] Adding a modality to a pre-trained model
    Hi, I have a dataset with video and other modalities (e.g. audio), and I want to run a captioning task. I found UniVL, which is a pre-trained model that supports video and text (transcripts) and can caption them. It extracts features and runs transformer encoders on both these modalities to get an embedding, then concatenates them and feeds it into a cross-encoder and decoder to get captions. I'm wondering if I can make use of this model, but add in other modalities, by writing my own embedding model and feeding the embeddings into the cross encoder. Would this work? Is there any similar previous work regarding adding new modalities to a pre-trained network? submitted by /u/joeswansonx69x [link] [comments]  ( 9 min )
    [D] What is the current SOTA of Neural Architecture Search (NAS)?
    I've seen classic papers before 2021 that have been quite influential - RL and evolution based strategies. I have also seen: differentiable approaches: https://arxiv.org/abs/1806.09055 zero-learning approaches: https://arxiv.org/abs/2006.04647 But these are all papers pre-2021. From people who are familiar with this field, what is the current SOTA of neural architecture search (NAS) post 2022? i.e. papers that can serve as the most relevant baselines? Thank you! :) ​ ​ ​ submitted by /u/Cultural-Average3959 [link] [comments]  ( 9 min )
    [D] Is active learning a dying field in industry, given the development in few shot/zero shot learning?
    Is active learning a dying topic when zero shot learning came out? Active learning is to used few labeled samples plus a initially trained model to select the most useful unlabeled data for training. Zero/few shot learning is to train a model on some data then Mae it work directly with unseen label/data. In my understanding, zero/few short learning is more aligned with the current large model trend or foundation model trend. Active learning strategy seems to still rely on small dataset and was intending to gradually enrich training data by selecting new samples in. In industry and in big tech, which one is more used or deployed? Anyone can give me some comments? submitted by /u/Little-Bumblebee-452 [link] [comments]  ( 9 min )
    [R] Decoding LLM Uncertainties for Better Predictability
    Hi all, Building off our last research post, we wanted to figure out ways to quantify "ambiguity" and "uncertainty" in prompts/responses to LLMs. We ended up discovering two useful forms of uncertainty: "Structural" and "Conceptual" uncertainty. In a nutshell: Conceptual uncertainty is when the model isn't sure what to say, and Structural uncertainty is when the model isn't sure how to say it. You can play around with this yourself in the demo or read about it in more detail in the blog post submitted by /u/shayanjm [link] [comments]  ( 9 min )
    [D] For large datasets, is your data selection process limiting model performance?
    I often hear from folks with very large datasets saying: “my labelling costs keep increasing, but we don’t see model performance improvements” or “my storage and compute costs are rising (for a dataset of 1M+ images) but performance just stalled”. This post argues that large datasets have hidden costs, beyond time and money, poor data quality and the wrong selection process might be killing model performance. Any thoughts? Have you faced this challenge? submitted by /u/btcmx [link] [comments]  ( 9 min )
    [P] SemanticSearch for PDF mining
    Hello, everyone! I'm seeking tips to enhance my semantic search pipeline. Currently, I'm working on a semantic search tool. Given a set of text files, my goal is to retrieve the most relevant information related to the query. To achieve this, I begin by preprocessing the PDF files, splitting them into pages, and computing embeddings using a fine-tuned BERT model for Italian. Next, with a query and its embedding, I calculate the cosine similarity to all the pages in the document. Since there aren't many pages, a brute search remains quite fast. However, I'm encountering an issue where the similarity results don't consistently yield the most relevant information. I've experimented with various embedding layers, but there's been little to no improvement. I've also tested a commercially available solution to ensure the problem isn't with my PDF files. Interestingly, I achieved better results, leading me to believe that the issue may lie within my pipeline. My current hypothesis is that the page splitting process might be excluding relevant semantic connections, and I may need to improve my text preprocessing. What suggestions do you have to enhance my results? P.S. The information obtained from the similarity check is subsequently used as context with a chat language model, similar to tools like AsMyPdf. submitted by /u/AcquaFisc [link] [comments]  ( 9 min )
    [D] Good compression algo to compress model checkpoints?
    I have a couple of terabytes of checkpoints, and I desperately need to free up some space, without deleting those atm. Is there a compression algorithm that can handle such data successfully? I tried gzip with tar but the compressed size ended up being only ~100G less - that's when I realized that (gzip) compression algo is not good at handling seemingly random numerical data. Do you know of methods that've proven to work in this scenario? submitted by /u/OpeningVariable [link] [comments]  ( 9 min )
    [R] Think before you speak: Training Language Models With Pause Tokens
    https://arxiv.org/pdf/2310.02226.pdf Abstract Language models generate responses by producing a series of tokens in immediate succession: the (K+1)th token is an outcome of manipulating K hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, K+10 hidden vectors, before it outputs the (K+1)th token? We operationalize this idea by performing training and inference on language models with a (learnable) pause token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate pause-training on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of 18% EM score on the QA task of SQuAD, 8% on CommonSenseQA and 1% accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm. Here is a Medium post about my thoughts on the paper. submitted by /u/transformer_ML [link] [comments]  ( 9 min )
    [D] Can Direct Preference Optimization (DPO) be used to replace any type of RL for LLMs, or is it better suited for just scenarios like RLHF?
    DPO Paper I read a really fascinating paper where RL was used on LLMs to make them better at interacting in embodied environments. https://arxiv.org/abs/2310.08588 The technique was called Reinforcement Learning with Environmental Feedback (RLEF). In the paper PPO was used, but I'm wondering if DPO could be used to replace it? submitted by /u/30299578815310 [link] [comments]  ( 9 min )
    How to create dataset for training generative chatbot model? [D]
    i built my own custom generative ai chatbot model. only thing i need is high quality and diverse dataset to train my model. i cant use already existing datasets because i dont think they are diverse and quality enough.so i need to create it using gpt4. my dataset will have 3 columns ; system_prompt, input, output. but im not very experienced on creating datasets, and i couldnt find any resources about this. all input ,output and system prompt all should be created by gpt4. how can i do it? and what is most effective way to use api for this? submitted by /u/Many-Corner-6700 [link] [comments]  ( 9 min )
    [P] MergeLlama-7b - A fine tune of CodeLlama for resolving merge conflicts
    Merge conflicts are something that give developers hours of headaches and I figured I would try and give my take on a solution. I followed a paper from IEEE engineers in 2022 who trained CodeBert on merge conflicts as a classification task, and they published their dataset for public use. Input formatted as “>>>>>>” will output the attempted conflict resolution. I am still trying to find out how to do evaluations on this model as the loss applies to all sections not just the resolution, and the TRL Trainer with a data collator gives NaN as a loss. The model and dataset are on HuggingFace under codys12/MergeLlama and codys12/MergeLlama-7b. Any feedback is appreciated! submitted by /u/cstein123 [link] [comments]  ( 9 min )
    [P] OpenLLMetry, a way to get complete visibility into RAG pipelines with your existing tools
    Hey, I've built a set of extensions for OpenTelemetry that provides visibility into LLM applications like RAG pipelines - whether it be prompts, vector DBs and more. Here’s the repo: https://github.com/traceloop/openllmetry. Two key benefits with OpenTelemetry are - You can trace your entire system execution, not just the LLM (so you can see how requests to DBs, or other calls affect the overall result); You can connect to any monitoring platform—no need to adopt new tools. Install the SDK and plug it into Datadog, Sentry, or both. Or switch between them easily. There's already support for OpenAI, Anthropic, Cohere, Pinecone, Chroma, LangChain, and Haystack and we are working hard to support the entire ecosystem. Would love to hear your thoughts submitted by /u/nirga [link] [comments]  ( 9 min )
    Can AI Replace Developers? Princeton and University of Chicago's SWE-bench Tests AI on Real Coding Issues [N]
    Exploiting AI to make software programming easier? SWE-bench, a unique evaluation system, tests language models' ability to solve real GitHub-collated programming issues. Interestingly, even top-notch models manage only the simplest problems, underscoring tech development's urgency for providing practical software engineering solutions. For the latest advancements in AI, look here first. https://preview.redd.it/rq5vl22bckub1.png?width=1292&format=png&auto=webp&s=d79988bfe0ab37b0f97f55296d7a7341c9292c11 A New Approach to Evaluating AI Models Researchers use real-world software engineering problems from GitHub to assess language models' coding problem-solving skills. SWE-bench, introduced by Princeton and the University of Chicago, offers a more comprehensive and challenging benchmark…  ( 9 min )
    How to design a Chat-GPT or Bard-like large scale app with your own foundational model? [D]
    I am just puzzled how does one efficiently query a huge transformer model such that so many users can be served at the same time. Is it queried on per user basis? (modulo some caching) If yes, how expensive is this? If no, what the hell is going on? :D Are there any good resources on this? (how to build large scale apps with big models, from scratch). Somehow this doesn't really fit the standard data-intensive system design process, or maybe I am missing something. submitted by /u/jimmymvp [link] [comments]  ( 9 min )
  • Open

    Taken on my screen, but I can’t get over what it has become. I’m obsessed with AI.
    submitted by /u/Prestigious_Rough704 [link] [comments]
    I built an AI tool to help authors create webcomics
    I always did want to draw a comic but I was never very good at drawing even though I put a lot of effort into it when I was younger... :'( So when I stumbled on image generation AI, I thought maybe it could help me transform my doodles into something decent. It took me a while and a lot of effort to write a tool to help me with that : story and dialogues are my own, images are based on doodles enhanced by AI. I would love to have feedback about the story : https://stripik.com/story/4/chapter/4/ ​ https://preview.redd.it/dvcudd4j3mub1.png?width=800&format=png&auto=webp&s=717bef60eaaf9b9a35a1a66f266c374406a923fa submitted by /u/maxcmoi [link] [comments]
    I'm chronicling the process of trying to create a boardgame with Chat GPT and it's amazing just how great of an assistant it is!
    submitted by /u/SexyJimBelushi [link] [comments]
    If SEO tools were Nintendo 3DS games [Powered by AI]
    Did you play these (SEO) games? 👾 https://preview.redd.it/yxuzllzupkub1.jpg?width=661&format=pjpg&auto=webp&s=23ebc6e972ac85b152aa8b69f48e2b0c5bae2c76 https://preview.redd.it/x8zfokzupkub1.jpg?width=661&format=pjpg&auto=webp&s=be2163a7bfbeee64a63c1292a5b4c482c5be33ae https://preview.redd.it/eerpgnzupkub1.jpg?width=661&format=pjpg&auto=webp&s=d8eceafd3732653c743a6731ae5932c9e0da071c https://preview.redd.it/uxwgskzupkub1.jpg?width=661&format=pjpg&auto=webp&s=07c751eaa16f8fa484034c98a3c1fd0b2162f5a2 Source: https://twitter.com/carlos_darko/status/1713900305765605484 submitted by /u/DanielPeris [link] [comments]
    Can AI Replace Developers? Princeton and University of Chicago's SWE-bench Tests AI on Real Coding Issues
    Exploiting AI to make software programming easier? SWE-bench, a unique evaluation system, tests language models' ability to solve real GitHub-collated programming issues. Interestingly, even top-notch models manage only the simplest problems, underscoring tech development's urgency for providing practical software engineering solutions. For the latest advancements in AI, look here first. https://preview.redd.it/8laeg7cbckub1.png?width=1292&format=png&auto=webp&s=e549f0045a7253cd2d3f351d8297a301c4cbf6ac A New Approach to Evaluating AI Models Researchers use real-world software engineering problems from GitHub to assess language models' coding problem-solving skills. SWE-bench, introduced by Princeton and the University of Chicago, offers a more comprehensive and challenging benchmark…
    Deep fake language change
    What is the best free tool to make a video where the language changes? submitted by /u/Easy_Technology6768 [link] [comments]
    One-Minute Daily AI News 10/15/2023
    New York-based tech firms and investors see the advent of AI as the latest opportunity to try to unseat the Bay Area as tech’s global capital.[1] Microsoft announced a new “bug bounty” program, vowing to reward security researchers between $2,000 and $15,000 if they’re able to find “vulnerabilities” in its Bing AI products, including “jailbreak” prompts that make it produce responses that go against the guardrails that are supposed to bar it from being bigoted or otherwise problematic.[2] OpenAI is preparing to launch a suite of updates to make it more cost-effective and efficient for developers to create software applications with AI models.[3] TCS Seeks to Use Microsoft AI Partnership to Improve Margins.[4] Sources: [1] https://www.axios.com/2023/10/12/new-york-ai-world-capital [2] https://futurism.com/the-byte/microsoft-bing-ai-bug-bounty [3] https://www.techedt.com/openai-aims-to-attract-developers-with-cost-effective-updates-insiders-reveal [4] https://www.bloomberg.com/news/articles/2023-10-15/tcs-seeks-to-use-microsoft-ai-partnership-to-improve-margins#xj4y7vzkg submitted by /u/Excellent-Target-847 [link] [comments]
    Are there an image generators that can generate the same image you upload to it, but from a different hypothetical angle?
    I was wondering if any AI image generation was good at this (yet?). I have a real-life image I want to upload and get AI to generate what that would most likely look like from the vantage point of someone standing at a different angle. submitted by /u/YepperyYepstein [link] [comments]
    AI dubbing ( local )
    Hi there, anybody knows how AI dubbing translator works ? As im interested if something similiar to https://app.rask.ai/ exist localy ?? Is there anything from github? Im looking for czech language. I know you can scribe audio to text than translate text and let AI to talk this text. But is there a tool that do all of this in one click ? Thank you and have a nice day. submitted by /u/Low_Government_681 [link] [comments]
  • Open

    A method to interpret AI might not be so interpretable after all
    Some researchers see formal specifications as a way for autonomous systems to "explain themselves" to humans. But a new study finds that we aren't understanding.  ( 9 min )
  • Open

    How Veriff decreased deployment time by 80% using Amazon SageMaker multi-model endpoints
    Veriff is an identity verification platform partner for innovative growth-driven organizations, including pioneers in financial services, FinTech, crypto, gaming, mobility, and online marketplaces. In this post, we show you how Veriff standardized their model deployment workflow using Amazon SageMaker, reducing costs and development time.  ( 8 min )
  • Open

    Improving traffic evacuations: A case study
    Posted by Damien Pierce, Software Engineer, and John Anderson, Senior Research Director, Google Research Some cities or communities develop an evacuation plan to be used in case of an emergency. There are a number of reasons why city officials might enact their plan, a primary one being a natural disaster, such as a tornado, flood, or wildfire. An evacuation plan can help the community more effectively respond to an emergency, and so could help save lives. However, it can be difficult for a city to evaluate such a plan because it is not practical to have an entire town or city rehearse a full blown evacuation. For example, Mill Valley, a city in northern California, created a wildfire evacuation plan but lacked an estimate for how long the evacuation would take. Today we describe a c…  ( 94 min )
  • Open

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
    How trustworthy are generative pre-trained transformer (GPT) models? To answer this question, University of Illinois Urbana-Champaign, together with Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research, released a comprehensive trustworthiness evaluation platform for large language models (LLMs), which is presented in the recent paper: DecodingTrust: A Comprehensive Assessment of Trustworthiness […] The post DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models appeared first on Microsoft Research.  ( 11 min )
  • Open

    Explainable Artificial Intelligence (XAI) for AI & ML Engineers
    Introduction Hello AI&ML Engineers, as you all know, Artificial Intelligence (AI) and Machine Learning Engineering are the fastest growing fields, and almost all industries are adopting them to enhance and expedite their business decisions and needs; for the same, they are working on various aspects and preparing the data for the AIML platform with the help of SMEs… Read More »Explainable Artificial Intelligence (XAI) for AI & ML Engineers The post Explainable Artificial Intelligence (XAI) for AI & ML Engineers appeared first on Data Science Central.  ( 23 min )
  • Open

    Nearest, easiest, and most accessible
    From Love What Lasts, Joshua Gibbs: … there are too many things in the world to care equally about them all. The sheer volume of things … demands that we have hierarchical standards by which to judge their value, or else we are condemned to give our lives over entirely to what is nearest, easiest, […] Nearest, easiest, and most accessible first appeared on John D. Cook.  ( 4 min )
  • Open

    Benchmarking Bit Errors in Quantized Neural Networks with PyTorch
    Similar to my article series on adversarial robustness, I was planning to have a series on bit errors robustness accompanied by PyTorch code. Instead, due to time constraints, I decided to condense the information into a single article. The code for the originally planned six articles is available on GitHub. The post Benchmarking Bit Errors in Quantized Neural Networks with PyTorch appeared first on David Stutz.  ( 6 min )
  • Open

    Rethinking the Role of PPO in RLHF
    Rethinking the Role of PPO in RLHF TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way? Figure 1: This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component - pairwise policy gradient, we can unify the reward modeling stage and RL stage, enabling direct updates based on pairwise responses. Large Language Models (LLMs) have powered increasingly capable virtual assistants, such as GPT-4, Claude-2, Bard and Bing Chat. These systems can respond to complex user queries, write code, and even produce poetry. T…  ( 6 min )

  • Open

    Johnson circle theorem
    Draw three circles of radius r that intersect at a single point. Then draw a triangle connecting the remaining three points of intersection. (Each pair of circles intersects in two points, one of which is the point where all three circles intersect, so there are three other intersection points.) Then the circumcircle of the triangle, […] Johnson circle theorem first appeared on John D. Cook.  ( 5 min )
  • Open

    NVIDIA Blackwell B100 GPUs To Feature SK Hynix HBM3e Memory, Launches In Q2 2024 Due To Rise In AI Demand
    submitted by /u/norcalnatv [link] [comments]
    Researchers propose GameGPT: A multi-agent approach to fully automated game development
    Game dev is super complex nowadays - games have huge codebases, massive teams, and dev cycles dragging on for years. Costs are insane too - budgets can hit $100M+ easily. In a new paper, researchers propose to reverse this trend with an AI framework called GameGPT that automates parts of the dev process using multiple AI agents. Each agent handles a different role (all are fine-tuned from relevant base models): One agent reviews the game design plan to catch errors Another turns tasks into code implementations Reviewer agents check the code and results A testing agent validates everything works as expected By breaking up the workflow, GameGPT can simplify things for the AI agents. They just focus on a narrow role versus having one jack-of-all-trades agent. The authors argue GameGPT can eliminate repetitive and rote elements of gamedev like testing. This would free up developers to focus on creative design challenges. However, the GameGPT paper does not include any concrete results or experiments demonstrating improved performance. There is no evidence presented that GameGPT reduces hallucinations, redundancy or development time. The authors mention empirical results support their claims that the architecture is more effective, but none are provided. I could not find any additional support material about this work, like a project website, that I could use to further check into this (maybe someone can share in the comments?). Right now GameGPT seems mostly conceptual. The ideas are interesting but hard to assess without quantitative results. TLDR: New GameGPT AI framework aims to automate tedious parts of game development using specialized agents. No concrete results were provided in the paper - someone will need to test this out and report back. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]
    Speech Condenser: An Advanced On-Premise Pipeline Tool for Streamlining and Summarizing Dialogues from Videos
    submitted by /u/nez_har [link] [comments]
    Best way to produce consistent images?
    Hi! I'm trying to jazz up my design portfolio for applying for jobs, and I wanted to insert some cute illustrations on each project page. The projects deal with a variety of topics so I'll need pictures of many things, but want to keep the style quite consistent. What is the best AI tool right now to do this? I paid for Midjourney but I can't seem to understand how to get it to do this. For example I got this image from DALLE and love the style, the white background also helps make it look better on the portfolio. I'd want another image in the same style of two kids throwing a ball, but can't figure out how to do it. Alternatively if I could upload this image to an AI and say "in the same style, generate..." that would be great too. Thank you! submitted by /u/_Dip_ [link] [comments]
    Messi vs Ronaldo | Freestyle Rap Song | AI Rap Song | Tell your opinion on this video
    submitted by /u/Agitated-Spell3979 [link] [comments]
    Seeking Your Feedback on a new community around Open-Source AI Code Generation Models
    Currently, we are building a community that is specifically dedicated to Open-Source AI Code Generation Models. Our aim is to create a thriving ecosystem where developers, enthusiasts, and experts can come together to drive innovation, share insights, and promote a collaborative approach to AI code generation. I wanted to provide you with an overview of the key features we're integrating into this community: 1. Collaboration: A dedicated space where enthusiasts and experts alike can collaborate on projects, share their findings, and work on enhancing existing models. 2. Discussion: Whether through forums or chat platforms, we aim to foster discussions around the challenges, breakthroughs, and best practices in the realm of AI code generation. 3. Resource Sharing: Our community will feature a repository/platform for members to freely share and access open-source models, datasets, and other essential tools. With your experience and insight into the AI domain, we would greatly appreciate your feedback on the following:- - Do you believe such a community would be valuable to you personally or to the wider developer community? - Would you consider becoming a part of such a community? - You are already a part of such a community and this one might not be of much value to you? - Any other suggestions or feedback? Your candid feedback on this idea, its potential impact, and any suggestions you might have will be invaluable to us as we continue shaping this community's structure and offerings. submitted by /u/akanshtyagi [link] [comments]
    Biden eyes adding AI chip curbs to Chinese companies abroad
    The Biden administration is considering closing a loophole that gives Chinese companies access to American artificial intelligence (AI) chips through units located overseas. The United States previously restricted shipments of AI chips to China but left overseas subsidiaries of Chinese companies with unfettered access. The Biden administration is now looking for ways to close this loophole and prevent China from accessing top AI technology. However, it is challenging to plug every gap in export controls. Chinese firms are purchasing chips for use in data centers abroad, and it is difficult for the United States to police those transactions. The United States has been seeking to halt the rise of China's AI capability, which depends on its access to U.S. chips. Washington has been working to close other loopholes that allow AI chips into China, and the new rules expected this month will likely apply those same restrictions more broadly to all companies in the market. The U.S. government is also grappling with the issue of Chinese parties accessing U.S. cloud providers like Amazon Web Services. Overall, the Biden administration is facing challenges in cutting China off from top AI technology and closing all loopholes in export controls. Source : https://www.reuters.com/technology/biden-eyes-adding-ai-chip-curbs-chinese-companies-abroad-2023-10-13/ submitted by /u/NuseAI [link] [comments]
  • Open

    SOTA Facial Recognition [D]
    I want to sort folders of pictures of people that are similar to an input photo by similarity. I managed to use DeepFace but I'm wondering if anyone knows a better method? ​ submitted by /u/RedditAlreaddit [link] [comments]  ( 9 min )
    [R] Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
    Paper: https://arxiv.org/abs/2309.17382 Project page: https://agentification.github.io/RAFA Code: https://github.com/agentification/RAFA_code Reason for future, act for now (RAFA) TL;DR: - The first autonomous LLM agent RAFA with provable regret guarantees and outstanding empirical performances. - SOTA results on Game of 24, ALFWorld, BlocksWorld, and Tic-Tac-Toe. submitted by /u/WolverineUnable5957 [link] [comments]  ( 9 min )
    [P] Machine Learning Algorithm from Scratch
    submitted by /u/shaongit [link] [comments]  ( 8 min )
    [D]Was any further work done on the paper "Large-Scale Study of Curiosity-Driven Learning" in recent years?
    So, a few weeks ago, I got interested in the exploration problem in Reinforcement Learning and came across this amazing paper. Just wanted to know if any of you came across any paper which explores this idea more or takes it forward. Thanks in advance. submitted by /u/Interesting-Weeb-699 [link] [comments]  ( 9 min )
    [R] tool to brainstorm novel ideas
    Hey folks, I developed a research tool https://idea-factory.ngrok.dev/ (Login: [temp@holistic-intelligence.net](mailto:temp@holistic-intelligence.net) Password: noidea) to identify novel research problems grounded in the scientific literature. Given an idea that intrigues you, the tool identifies the most relevant pieces of literature, creates a brief summary, and provides three possible extensions of your idea. I would be happy to get your feedback on the usefulness of them. Thank you in advance! submitted by /u/Ma7dy [link] [comments]  ( 9 min )
    [P] Oddly Satisfying Animation of Pixel Shuffle
    submitted by /u/Animated-AI [link] [comments]  ( 8 min )
    [D] Pipeline for data processing in time series forecasting?
    What is the correct pipeline for data processing when conducting time series forecasting? Should we begin with data normalization/standardization, followed by feature selection, and then split the data into training, validation, and test sets? Or is it advisable to initially split the data to prevent spill-over effects? I'm concerned about the possibility of training my model on (part of) the test data, which could result in spill-over effects. However, if the recommended approach is to split the data first and then perform normalization and feature selection, what impact would this have on the selected features? Does the manner in which we split the data into random time periods matter, or is it necessary to incorporate a validation method that accounts for temporal effects? I'm worried that the selected features might depend on the time period I choose for my training and test sets. What is the best practice in this scenario? submitted by /u/Ambitious-Pay6329 [link] [comments]  ( 9 min )
    [R] Researchers propose GameGPT: A multi-agent approach to fully automated game development
    Game dev is super complex nowadays - games have huge codebases, massive teams, and dev cycles dragging on for years. Costs are insane too - budgets can hit $100M+ easily. In a new paper, researchers propose to reverse this trend with an AI framework called GameGPT that automates parts of the dev process using multiple AI agents. Each agent handles a different role (all are fine-tuned from relevant base models): One agent reviews the game design plan to catch errors Another turns tasks into code implementations Reviewer agents check the code and results A testing agent validates everything works as expected By breaking up the workflow, GameGPT can simplify things for the AI agents. They just focus on a narrow role versus having one jack-of-all-trades agent. The authors argue GameGPT can eliminate repetitive and rote elements of gamedev like testing. This would free up developers to focus on creative design challenges. However, the GameGPT paper does not include any concrete results or experiments demonstrating improved performance. There is no evidence presented that GameGPT reduces hallucinations, redundancy or development time. The authors mention empirical results support their claims that the architecture is more effective, but none are provided. I could not find any additional support material about this work, like a project website, that I could use to further check into this (maybe someone can share in the comments?). Right now GameGPT seems mostly conceptual. The ideas are interesting but hard to assess without quantitative results. TLDR: New GameGPT AI framework aims to automate tedious parts of game development using specialized agents. No concrete results were provided in the paper - someone will need to test this out and report back. Full summary here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Generate audio samples based on promp sample
    hi, I would like to create a system that generate different audio samples, based on an audio sample prompt. Does anyone know whether such a project or similar ideas have been already implemented? Or any suggestion on what to read in order to realize such a project? I have knowledge in ML programming and python audio generation. submitted by /u/busconw [link] [comments]  ( 9 min )
    [D] Getting bad MFUs, what can I do to make it better
    Hi, so I've been working with NanoGPT, finetuning GPT-2, and I'm getting terrible MFUs, with 5 warmup steps at -100% and normal steps have an MFU of around 3-4%. Most runs I hear of have an MFU at around 45%? How do get this better? Colab -> https://colab.research.google.com/drive/1gvTsyjxHiDkKHFsnWWouzr1xJWW23BA3?usp=sharing Code -> https://github.com/VatsaDev/NanoPhi2 submitted by /u/vatsadev [link] [comments]  ( 9 min )
    [D] Check out my latest article on how the new improvements in GPT-4V(ision) can bring on a new ear of computer vision models, fine-tuned on outputs of GPT-4V(vision).
    https://medium.com/@rishiswethan.c.r/how-gpt-4v-ision-will-revolutionise-image-annotation-b0d3ace64bff?source=friends_link&sk=4be42541a8a8ee40e18ef14533342cfd submitted by /u/Remarkable_Seesaw_89 [link] [comments]  ( 8 min )
    How to object detection in Unity any good resources [D]
    I have tired barracuda, vuforia and it doesn’t work for some reason. And completely lost atm. It’s an object detection model to detect the circuit schematic symbols using computer vision submitted by /u/PreferenceFrosty2958 [link] [comments]  ( 9 min )
    [D] Running Large Language Models on CPU
    Fine-tuning large language models with the aim of obtaining a small but accurate model is extremely difficult. This is because you have to strike a balance between the model’s size and accuracy. Researchers from IST Austria & Neural Magic seem to have found a sweet spot. In their latest paper, they successfully applied sparse fine-tuning on MPT with remarkable performance. The MPT model was pruned to 75% without a drop in accuracy, showing performance that is on-par with quantization approaches. Particularly, the resulting sparse model can execute fast on CPUs by taking advantage of the sparsity. Instead of performing standard loss-based fine-tuning which may fail to recover accuracy, the researchers experiment with distillation-type losses. These losses are better at recovering accuracy at high sparsity. What’s impressive is that the sparse fine-tune LLM can achieve 7.7 tokens per second on a single core and 26.7 tokens per second on 4 cores of an cheap consumer AMD Ryzen CPU. The MPT-7B model was fine-tuned via SFT obtaining a dense baseline that showed remarkable performance. This baseline was later pruned with SparseGPT to 40% to 80% reaching 5X compression ratios. By applying SquareHead KD, FP32 models with 75% can be obtained with NO accuracy loss, outperforming cross-entropy and other KD methods. The paper is available on Arxiv. Sparse Finetuning for Inference Acceleration of Large Language Models: https://huggingface.co/papers/2310.06927 MPT Sparse Finetuned on GSM8k with DeepSparse Hugging Face Space: https://huggingface.co/spaces/neuralmagic/sparse-mpt-7b-gsm8k submitted by /u/mwitiderrick [link] [comments]  ( 9 min )
    [P] I built an AI Writing Coach to proofread your work
    submitted by /u/hungryillini [link] [comments]  ( 8 min )
    [R] Conceptual Framework for Autonomous Cognitive Entities - Clemson University 2023 - Introducing the ACE Framework
    Paper: https://arxiv.org/abs/2310.06775 GitHub: https://github.com/daveshap/ACE_Framework Blog post: https://medium.com/@dave-shap/autonomous-agents-are-here-introducing-the-ace-framework-a180af15d57c Abstract: The rapid development and adoption of Generative AI (GAI) technology in the form of chatbots such as ChatGPT and Claude has greatly increased interest in agentic machines. This paper introduces the Autonomous Cognitive Entity (ACE) model, a novel framework for a cognitive architecture, enabling machines and software agents to operate more independently. Drawing inspiration from the OSI model, the ACE framework presents layers of abstraction to conceptualize artificial cognitive architectures. The model is designed to harness the capabilities of the latest generative AI technologies, including large language models (LLMs) and multimodal generative models (MMMs), to build autonomous, agentic systems. The ACE framework comprises six layers: the Aspirational Layer, Global Strategy, Agent Model, Executive Function, Cognitive Control, and Task Prosecution. Each layer plays a distinct role, ranging from setting the moral compass and strategic thinking to task selection and execution. The ACE framework also incorporates mechanisms for handling failures and adapting actions, thereby enhancing the robustness and flexibility of autonomous agents. This paper introduces the conceptual framework and proposes implementation strategies that have been tested and observed in industry. The goal of this paper is to formalize this framework so as to be more accessible. ​ https://preview.redd.it/7scnwk5a5dub1.png?width=850&format=png&auto=webp&s=371b5b02a453dcad3e70a2600cc2d625eda44133 ​ submitted by /u/Prior-Travel3670 [link] [comments]  ( 9 min )
    [D] Fine tune Llama2 with Lora for foreign language
    Hey folks, I watched a YouTube video, about how some LLMs tokenise languages other than English. For example for the Greek language you will see that this is failing totally, as one character is one token always: ​ https://preview.redd.it/835p97cyhcub1.png?width=1900&format=png&auto=webp&s=944b150cc0fc112cb8cd2bac600f6fcdcc85fb1e My question is, if I would fine-tune it with Alpaca Lora based on Greek text, would the tokeniser change and work properly? Or the fine tune would not work as the tokeniser cannot be retrained/tuned? submitted by /u/kostakos14 [link] [comments]  ( 9 min )
    [D] Advice for applying to undergraduate research internships?
    Hello, I’m a 3rd year data science and linguistics major at a top 30 school looking to land an internship at industry research. I’d say I’m fairly competitive. Extensive research experience. 2nd author at EMNLP, and did an REU at a prestigious institute. I’m already looking at some places such as AI2, but I’m curious if there are other internships I should be aware of. submitted by /u/Kai_151 [link] [comments]  ( 9 min )
    [D] The history of neural network is over. J. Schimdhuber proposes a giant network that includes all future neural network architecture as a subcomponent.
    submitted by /u/fromnighttilldawn [link] [comments]  ( 8 min )
    [P] How do I make my CNN more efficient?
    I've been trying a variety of pre-constructed and self-made U-net-like CNNs. Had a few questions: When using torch summary, is there a general formula for estimating a model's inference time/backprop time and required GPU ram based on the information torch summary gives ( Total params, Trainable params, Non-trainable params, Total mult-adds (G), Input size, Forward/backward pass size (MB), Params size (MB), Estimated Total Size (MB)), and other hyper-parameters such as batch size? Why is my self-made model (which has smaller quantities in all the parameters torch summary outputs) requiring more GPU ram AND taking more time for inference and backprop? Is the coding style for the model's class and its forward prop a huge factor here? If so, could you please provide tips for making my code more efficient? Here's the notebook showcasing a pre-made model from MONAI and two of my self-made models: https://colab.research.google.com/drive/1VRrdnzaAbp25_DtaWTKHW5JxzyhmueMC?usp=sharing I've also listed some of my observation on the models and their results in the notebook. Any ideas or suggestion would be much appreciated. submitted by /u/mimivirus2 [link] [comments]  ( 9 min )
    [P] Made a Python package for creating API endpoints with dynamic queries.
    submitted by /u/squirrels-api [link] [comments]  ( 9 min )
    [R] Supercharging reinforcement learning with logic
    Deep reinforcement learning has led to a variety of compelling results. However, performance issues, particularly relating to the data efficiency of simulation has limited it applicability in domains where simulations run more slowly. Our solution is to use a logic base framework, PyReason, as a proxy for the simulation. ​ https://preview.redd.it/kdhpu9qraaub1.png?width=1786&format=png&auto=webp&s=8155ba38fc66bd3a2fe934b1f395351c4db68e2f We showed that inference with PyReason logic program can provide up to a three order-of-magnitude speedup when compared with native simulations (we studied AFSIM and Starcraft2) while providing comparable reward and win rate (we found that PyReason-trained agents actually performed better than expected in both AFSIM and Starcraft2). ​ https://preview.…  ( 9 min )
    [D] transformers vs llama.cpp vs GPTQ vs GGML vs GGUF
    i am a little puzzled, i know that transformers is the HF framework/library to load infere and train models easily and that llama.cpp is another framework/library that does the more of the same but specialized in models that runs on CPU and quanitized and run much faster i understand that GGML is a file format for saving model parameters in a single file, that its an old problematic format, and GGUF is the new kid on the block, and GPTQ is the same quanitized file format for models that runs on GPU ​ so here is what i can't understand (assuming i got all the rest correct): does HF Transformers support loading GGUF or GGML models ? and does GGUF needs a tokenizer json or does the data comes from within the gguf file itself and is safetensors (another file format) supported by both Transformers and Llama.cpp ​ since i cannot find python examples for these combination i assume all the answers are - No ​ can anyone shed some light ? submitted by /u/Particular_Flower_12 [link] [comments]  ( 9 min )
  • Open

    Hi everyone , I was following an online RL tutorial that uses Stable baselines3 and Open AI's gym to implement a Cart Pole environment but I have ran into some problems. Can anyone of you please help me?
    I was following Nicholas Renotte's RL in 3 hours tutorial and I ran into this issue at time stamp 1:10:00 while testing my trained Agent. ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part. This is my code for testing my environment: episodes=5 for episode in range(1,episodes+1): obs=env.reset() done=False score=0 print(obs) while not done: env.render() action, _ = model.predict(obs) #Now using model here obs, reward, done, truncated, info = env.step(action) score += reward print('Episode:{} Score:{}'.format(episode,score)) env.close() And this is the environment I am using : environment_name = 'CartPole-v1' env=gym.make(environment_name,render_mode="human") the model variable has my trained model stored in it and is initialized as such : model =PPO.load(PPO_Path, env=env) The print(obs) function returns this value : (array([ 0.03954345, -0.04975226, -0.02942382, -0.02261402], dtype=float32), {}) I am running this code in a Notebook on VS code on an M2 Macbook running MacOS 13.5, I am using Python 3.9.15 and the latest version of all the other libraries and dependencies. Please help submitted by /u/Straight-Knowledge83 [link] [comments]  ( 9 min )
    Reinforcement Learning Platform for UAVs
    I'm doing a project that aims to use reinforcement learning (PPO variations) with UAVs. What are the most up to date tools are for implementing and trying new RL algorithms in this space? I've looked at AirSim, and it seems to no longer be supported by Micrsosoft. I've also been heavily looking at Flightmare, which is almost exactly what I want, but getting the tool that hasn't been maintained for years up and running is giving me headaches (and the documentation is not great/up to date either). Ultimately, what I'm looking for is: * Physics simulation * Photo-realistic vision * Built-in integration with Gym would be awesome * Python platform preferred, C++ also ok I've also used ROS/Gazebo with PyTorch previously, and that is my backup plan I suppose, but it's not photo-realistic and is kind of slow in my experience. submitted by /u/zeus_the_transistor [link] [comments]  ( 9 min )
    Training a RL Model with Continuous State & Action Space in a Real-World Scenario
    Hello everyone, I'm a Data Science student diving into an exciting thesis topic: using reinforcement learning to stabilize boats in rough seas by adjusting a keel's angle. But I am a bit concerned about the high complexity of the problem and the given situation: Action Space: Continuous, representing the keel's angle adjustments. State Space: Continuous, capturing the dynamic behavior of the sea, including waves. Training Environment: Currently, the company only has a real-world water tank setup to simulate the sea conditions. There's no computer simulation available. Given this setup, I have a couple of concerns: Is it possible to train an RL model effectively in such a complex real-world scenario without first having a computer simulation? And if yes, what would be your initial steps in doing so? Are there possibilities to reduce the problem's complexity while training exclusively in the real-world water tank simulation? (i.e. transforming the action space into a discrete action space?) Any insights or advice would be greatly appreciated! submitted by /u/No-Wasabi3556 [link] [comments]  ( 9 min )
    Supercharging reinforcement learning with logic
    Deep reinforcement learning has led to a variety of compelling results. However, performance issues, particularly relating to the data efficiency of simulation has limited it applicability in domains where simulations run more slowly. Our solution is to use a logic base framework, PyReason, as a proxy for the simulation. ​ https://preview.redd.it/6wmg0qnlaaub1.png?width=1786&format=png&auto=webp&s=01f82cf24de79b317b6f9406b0b6379b949a34d3 We showed that inference with PyReason logic program can provide up to a three order-of-magnitude speedup when compared with native simulations (we studied AFSIM and Starcraft2) while providing comparable reward and win rate (we found that PyReason-trained agents actually performed better than expected in both AFSIM and Starcraft2). ​ https://preview.redd.it/u8f44fskaaub1.png?width=1636&format=png&auto=webp&s=9509f03a936f41cd0131388564833b86a39c295a However, the benefits of our semantic proxy go well beyond performance. The use of temporal logic programming has two crucial beneficial by-products such as symbolic explainability and modularity. PyReason provides an explainable symbolic trace that captures the evolution of the environment in a precise manner while modularity allows us to add or remove aspects of the logic program – allowing for adjustments to the simulation based on a library of behaviors. PyReason is well-suited to model simulated environments for other reasons – namely the ability to directly capture non-Markovian relationships and the open-world nature (defaults are “uncertain” instead of true or false). We have demonstrated that agents can be trained using standard RL techniques such as DQN using this framework. Preprint: https://arxiv.org/abs/2310.06835 Video: https://youtu.be/9e6ZHJEJzgw Code for PyReason-as-a-Sim (integration with DQN): https://github.com/lab-v2/pyreason-rl-sim Code for PyReason Gym: https://github.com/lab-v2/pyreason-gym PyReason Home: neurosymbolic.asu.edu/pyreason/ ​ submitted by /u/Neurosymbolic [link] [comments]  ( 9 min )
    Actor-critic on piecewise constant reward function
    I made a environment with piece wise constant reward function for testing the network architecture. And its episode length is 1. The critic will try to learn this and become a piecewise constant function. And have a gradient close to 0 making the gradient vanish for the policy. I can think of some solutions: - Change the reward function to a dense reward But i wanted some other views; has anyone solved such problems? submitted by /u/Automatic-Web8429 [link] [comments]
    Help understanding the PETS algorithm
    I am trying to read this paper and I am unable to get the big picture over here. Can someone please explain what's going on in the Propagation and Planning stage? In the Model stage, I understand that they are using a Probabilistic Model to handle uncertainty. ​ https://preview.redd.it/idenqd492aub1.png?width=945&format=png&auto=webp&s=40da9bf53b21dbed63b70571f3833b0fe3a9dabb For instance, what does Particle mean in this paper? This big picture here is that I am trying to understand the Model Based Policy Optimization paper and it seemed like they built upon the above paper. submitted by /u/Academic-Rent7800 [link] [comments]
  • Open

    Supercharging reinforcement learning with logic
    Deep reinforcement learning has led to a variety of compelling results. However, performance issues, particularly relating to the data efficiency of simulation has limited it applicability in domains where simulations run more slowly. Our solution is to use a logic base framework, PyReason, as a proxy for the simulation. ​ https://preview.redd.it/pmukb2k7aaub1.png?width=1786&format=png&auto=webp&s=3fb36d0fbeb75393ae8f71f8f369ff5e0b79fbcb We showed that inference with PyReason logic program can provide up to a three order-of-magnitude speedup when compared with native simulations (we studied AFSIM and Starcraft2) while providing comparable reward and win rate (we found that PyReason-trained agents actually performed better than expected in both AFSIM and Starcraft2). ​ https://preview.…

  • Open

    [D] Detect anomaly with small dataset
    Hi guys, I'm hoping for advice on the direction to detect detect pattern/ anomaly at small scale. I understand there are certain tools out there for webpage monitoring, but let's say this is just an example that I'm ingesting small amount of hourly/daily traffic to a sub webpage on my site (anywhere from 50-100 visits per day, this may mean max ~30 visits/per hour) There are times when traffic to the page drops as the page doesn't fully load , or the other page on which I'm hosting the link to this page doesn't load resulting in people can't see the link tothis sub page). Giving the scope/scale of this, amount of the data, it's not possible for me to use other solutions for anomaly detection (those that costs like $100-$1000+/month) and I'm not sure where to start with ML with this minimal amount of hourly/daily data to monitor. Is there anything that I should look into? Thank you submitted by /u/duyth [link] [comments]  ( 9 min )
    [D] Foundational must reads for LLMs
    Came across this post https://community.openai.com/t/foundational-must-read-gpt-llm-papers/197003 As I am new to LLM's , Please share your thoughts on how to start and what subtopics to learn in depth ? ​ submitted by /u/Electrical_Study_617 [link] [comments]  ( 8 min )
    [D] Google AutoML Alternatives?
    Having jumped into AI this last year I've used Google AutoML a lot and it's honestly worked great. I primarily use it for text classification. Training usually takes anywhere from 4-8 hours. The results have been above 90% accurate on interference. Now, the problem. Cost. It's super expensive to run an endpoint for predictions with Google AutoML, for text classification. I'm wondering if anyone has any alternatives or ideas for similar results for cheaper. I am ok waiting for prediction results a bit as I don't need sub 1ms type responses lol. But everything I've tried has yielded less then optimal results. Tried various hugging face models, and accuracy is about 50%. submitted by /u/zepaz [link] [comments]  ( 9 min )
    [R] Octopus: Embodied Vision-Language Programmer from Environmental Feedback - Nanyang Technological University 2023 - Continually refines its understanding and execution, demonstrating impressive adaptability!
    Paper: https://arxiv.org/abs/2310.08588 Blog: https://choiszt.github.io/Octopus/ Github: https://github.com/dongyh20/Octopus Youtube short: https://www.youtube.com/watch?v=lHbTvB0yIP4 Abstract: Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. Furthermore, when seamlessly integrated into an embodied agent, it signifies a crucial stride towards the creation of autonomous and context-aware systems capable of formulating plans and executing commands with precision. In this paper, we introduce Octopus, a novel VLM designed to proficiently decipher an agent's vision and textual task objectives and to formulate intricate action sequences and generate executable code. Our design allows the agent to adeptly handle a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games. Octopus is trained by leveraging GPT-4 to control an explorative agent to generate training data, i.e., action blueprints and the corresponding executable code, within our experimental environment called OctoVerse. We also collect the feedback that allows the enhanced training scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we illuminate Octopus's functionality and present compelling results, and the proposed RLEF turns out to refine the agent's decision-making. By open-sourcing our model architecture, simulator, and dataset, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community. https://preview.redd.it/1zn9q3g7a8ub1.jpg?width=1651&format=pjpg&auto=webp&s=3b14f862b24784918d6b4514bf575cf29bc65edf https://preview.redd.it/sv2y06g7a8ub1.jpg?width=1079&format=pjpg&auto=webp&s=be9ab7dd7cf23018b6d1fa0c584ad301b04c8abf https://preview.redd.it/350xc6g7a8ub1.jpg?width=942&format=pjpg&auto=webp&s=53e57541d35ca23d06b8c5be71c2b0c1910fdf90 ​ submitted by /u/Singularian2501 [link] [comments]  ( 9 min )
    [R] My article about autonomous LLMs-based agents: Chain of Thought, Plan and Solve, Self-Ask, ReAct, Reflexion, Self-Consistency, ToT, and GoT; and intrinsic insights behind an autonomous LLMs-based agents.
    A Complete Guide to LLMs-based Autonomous Agents (Part I): https://medium.com/p/69515c016792 My article offers a comprehensive overview of LLM-based agents, covering Chain of Thought, Plan and Solve/Execute, Self-Ask, ReAct, Reflexion, Self-Consistency, Tree of Thoughts, and Graph of Thoughts. It traces their evolution from basic forms, driven primarily by prompt engineering, to advanced models that emulate human problem-solving intricacies. Moreover, it provides an engineer's insights into the architecture behind these autonomous agents. Naturally suitable for AI agent: LLMs feature a natural language interface tailored for user-computer interactions and they come equipped with innate reasoning abilities. LLM's Deficiency: Despite its strengths, GPT-4 can provide incorrect answers or hallucinations for complex tasks. Challenges with Training: Finetuning pretrained LLMs doesn't enhance reasoning capabilities. While creating a larger LLM can bolster its problem-solving skills, the process can span several months to a year, potentially leading to a two-year wait before its official launch. Closed Model and RAG: LLMs, once trained, are unable to fetch real-time data and have inherent shortcomings. However, for Q&A tasks, leveraging an open-book method proves more effective. The aim is not to have an all-knowing model but one skilled in reasoning and utilizing tools. LLM Agent Approach: We direct LLMs to break down intricate tasks, tackle individual sub-tasks, evaluate them, and make revisions of the strategy as needed. submitted by /u/Appropriate-Map-9923 [link] [comments]  ( 9 min )
    [D] Have a research paper to do for my masters in Big Data Analytics. Wanted to do something with ML. Just look for some advice.
    In my last semester and we have to pick a topic related to big data analytics. Right now I have to prepare a proposal for my topic. My topic will have to do with something to ML and the medicinal field. Current plan: Get a dataset related to my topic. Right now its Parkinson's disease. My question is, for the dataset would I need a dataset with text data or would images of scans of the brain be better for detecting say early detection be better? I cant figure out which would be the better dataset. Get the dataset and then use Azure machine learning to prepare my dataset and do some data cleaning and handling and then get a model out of it. I picked azure because I have azure license from my uni and after searching about, I read about the azure machine learning service. Would azure be a good choice for training my model on this task? I've mostly used google colab for training small models. Once the model is trained and setup. I want to setup a front end web app (flask) and then setup my model so that users can upload either text data or image scans and then model would output results regarding the inputted data. My question is, would it be ideal to have the model located on my local machine or would azure let me do api calls between my local to the azure trained model? Would all this be feasible to do? I'm not looking to develop a full fledge application, just want to create a model with a dataset of images or text and then be able to feed new images to the trained model and get an output. Just looking for opinions or advice on this topic. Thanks. submitted by /u/Jesustakethewheeeeel [link] [comments]  ( 9 min )
    [D] Is the topic of your ML PhD important?
    I read the previous discussion on whether a PhD is required in the field, and I had a follow-up question: does the topic of your PhD matter? So let’s say you finish a PhD in the field of medical machine learning (non-CV), would an automotive company, FAANG, or e.g. DeepMind still like to hire you once you would like to switch your sub-field a bit? Or are you simply less desirable than a candidate without a PhD but more experience in CV? I am asking this because I would like to stay flexible as I have many ML sub-fields I want to work in, and I do not want to limit my options by pursuing a PhD in a topic that I don’t want work in for my entire life. For context, I do already have 2 years working experience as an AI engineer and I am finishing my AI master’s. submitted by /u/Otoz123 [link] [comments]  ( 9 min )
    [D] Fine-Tuning tortoise tts
    I'm planning on creating my own AI voice to use with ChatGPT. I have done my research, and there are two ways to achieve a quality TTS model to use. I have tried them both. I fine-tuned tortoise tts on my own 20-minute dataset. I have also tried to create a model using Tacotron2 and the dataset. The quality of the fine-tuned model is better. But one downside is that I still have to give the fine-tuned tortoise model a reference voice for it to choose the voice that I fine-tuned it with. On the other hand, the trained model didn't need to. The question here is: why didn't the tortoise model choose the voice in the dataset as the default? Do I need to expand my dataset for it to be chosen as the main voice? ​ Thank all. submitted by /u/Capital_Birthday_654 [link] [comments]  ( 9 min )
    [R] Do pretrained Transformers Really Learn In-context by Gradient Descent?
    Do pretrained Transformers Really Learn In-context by Gradient Descent? https://x.com/Shadowkiller331/status/1713003711629516862?s=20 ​ https://preview.redd.it/zpwkh47hm7ub1.png?width=450&format=png&auto=webp&s=6def807c9c9f605e3f7839159db3402d837f6895 submitted by /u/Educational-Newt2052 [link] [comments]  ( 8 min )
    [R] Unlocking the power of Sparsity in Generative Models: 8x Faster LLMs on CPUs with Sparse Fine Tuning
    submitted by /u/markurtz [link] [comments]  ( 8 min )
    A[R]xiv [D]ives - Llama 2 Deep Dive
    We’ve been diving deep into foundational papers on Fridays as a group. It’s been helpful for us to get into the nitty gritty details of these papers, so hope you find it helpful too. Would love to have anyone join the discussion next week! submitted by /u/FallMindless3563 [link] [comments]  ( 9 min )
    [N] Most detailed human brain map ever contains 3,300 cell types
    What can this mean to artificial neural networks? submitted by /u/hhh888hhhh [link] [comments]  ( 8 min )
    [D] My fine tune behaves like the base model
    Hi all, I did a fine tune of CodeLlama-7b on a custom dataset and I was getting very excited because it was doing very well on evals. I saved the model with model.save() and model.push_to_hub() and it seemed to work. When I load the model it shows the structure with the Lora_A and Lora_B for every layer, but it now acts like the base model with no changes. Is it possible I saved wrong or likely that I am loading wrong? Any help is greatly appreciated! submitted by /u/cstein123 [link] [comments]  ( 9 min )
    [R] Machine Learning Courses Mega Bundle from Mammoth Interactive
    submitted by /u/brand_momentum [link] [comments]  ( 8 min )
    [R] best RL algorithm for a single turn game ?
    Hi there, I'm new to Reinforcement Learning (RL), and the papers I've come across mainly focus on scenarios where states change with choices in a game. However, I'm interested in finding the best RL algorithm for a simpler case. I have an input I and a policy P. P outputs probabilities for available choices (a limited set of integers), and a reward r is given for each choice (the reward is costly to compute that’s why I use RL). The goal is to train P to maximize the reward. So as if we are in a game that ends after only one choice. Any recommendations for the best RL algorithm in this case? Thanks! submitted by /u/Meddhouib10 [link] [comments]  ( 9 min )
    [D] Ways to get research experience before grad school
    I recently graduated with my bachelor's from a low ranked school with a good GPA. I was planning on starting a PhD studying NLP and applied to 12 mid level schools. However, I was unfortunately rejected from all the schools I applied to. I suspect it was likely due to my lack of experience in NLP research as my school didn't have any professors who do research in that area. My current plan is to work in industry for the next two years and try and do some NLP research on the side before reapplying. Do any NLP labs allow for external volunteer researchers? Besides that, are there any other ways to get research experience? submitted by /u/Bananas970 [link] [comments]  ( 9 min )
    [D] Validation loss is decreasing but WER is increasing in Whisper model training.
    Hi, I've been using the Huggingface library to fine-tune the Whisper model. While the WER was initially decreasing, I've noticed it began to rise even though the validation loss continues to drop. Could the issue be related to my testing on a very small dataset? As shown in the image, after 80th step the wer suddenly started increasing from 13 -> 28 https://preview.redd.it/xq2bm0oyh5ub1.png?width=838&format=png&auto=webp&s=136447f527bea6880b46ae588463500304b1d6bb ​ submitted by /u/aadityaura [link] [comments]  ( 9 min )
    Looking for An Easy-To-Use API To Train Image Model [D]
    Yo! I have some images I curated on MJ, I want to run these together into an AI and spit out more outputs like these. The current process has me get maybe .2% successful outputs through MJ I figure the next step to more outputs is training a custom model. What's the easiest way to do this using a web-based API? Does this involve using Stable Diffusion? submitted by /u/AdministrativePie991 [link] [comments]  ( 9 min )
    [D] Time Series Forecasting on positive AND negative Examples
    Hey 😀 not sure if extremely trivial or really tricky. In the end, I want a machine that generates a time series without further input based on training data, generating a new time series every time. I want this to be based on a transformer. I want it trained with data looking like this: 2023-07-03 14:19:48,GOOD 2023-07-04 13:59:07,GOOD 2023-07-05 01:58:54,GOOD 2023-07-05 03:30:05,BAD 2023-07-05 05:17:43,BAD 2023-07-06 05:35:34,GOOD 2023-07-07 14:06:03,GOOD 2023-07-08 21:16:05,BAD with “GOOD” and “BAD” being the state of the system which is likely dependent on the time series data up to that point. I have a lot of data and it’s data points like the one above with maybe a hundred rows of data on average for a few thousand systems. Every system is independent of all others but all are identical. I do not want to train only on “GOOD” as this would leave out a lot of valuable data … Is there a way to train a time series transformer with both data that leads to GOOD as well as BAD outcomes, so it would generate time series from scratch that are unlikely to have BAD outcomes? Thank you!! submitted by /u/_VeniVidiVeni_ [link] [comments]  ( 9 min )
    [P] VGSLify: Transform Your TensorFlow Model Prototyping Experience
    Hey r/MachineLearning! 🚀 Have you ever been frustrated with the lengthy and sometimes cumbersome TensorFlow code for defining models? Or wished you could experiment with different architectures without dealing with copious lines of code? That's where VGSLify steps in. Why Use VGSLify? Compact Definitions: VGSLify leverages VGSL spec, enabling you to express intricate model architectures in a compact and elegant manner. This means you can quickly experiment with different models by simply tweaking a string format, bypassing the verbose code traditionally required. Swift Prototyping: Craft intricate neural network architectures using succinct VGSL spec strings, allowing you to iterate faster and more efficiently. From TensorFlow to VGSL: Got a pre-existing TensorFlow model? Easily co…  ( 9 min )
    [D] How important is a PhD for industry?
    I'm 21 years old and currently pursuing a master's degree in theoretical physics in the UK. I have a strong interest in machine learning and have completed many computing courses as well as independent projects in this field. I'm considering a career in machine learning and I'm curious about the benefits of doing a PhD. I've heard that the salary difference may not be substantial. Could anyone provide insights on how important a PhD is for specific roles in this field? Additionally, what factors should I consider when deciding whether to pursue a PhD in machine learning, apart from my passion for ML? Also are private PhDs common in ML. Working in a company and asked them to pursue a PhD within the company? Thanks :) submitted by /u/Neat-Print2792 [link] [comments]  ( 9 min )
    [D] SHAP mask_token: why does it matter and which one to choose?
    submitted by /u/Being-Nothingness [link] [comments]  ( 9 min )
    [R] Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
    submitted by /u/hardmaru [link] [comments]  ( 8 min )
    [D] is there a good Code or Text model ?
    i am trying to detect code segments in a text response of an LLM, so i can highlight them using Highlight,JS, ​ is there a good model that can do the classification of a block of text and decide if it is a block of code or a block of NLP simple text (english) ? submitted by /u/Particular_Flower_12 [link] [comments]  ( 9 min )
  • Open

    Looking for developers / future founders who want to build and grow disruptive AI apps.
    I am a multi-time founder myself. I've secured millions from investors for my past startups and had notable success with a video app that gathered 4M users and $300k in revenue. However, due to the intense competition in the video app editing sector, my team and I couldn't turn a profit. After my last startup faltered during the covid period, I transitioned to being a full-time product-market fit and growth marketing consultant and have made really great money doing it. I assist new startups in avoiding the mistakes I made and implement frameworks that significantly increase their chances of success. I've observed that many new founders venture into startups without fully grasping the challenges of building something people genuinely desire. It’s really not easy. How would you know what y…
    AI Images Detectors Are Being Used to Discredit the Real Horrors of War
    A free AI image detector is being used to discredit a photograph of a burnt corpse of a baby killed in Hamas's attack on Israel. However, experts have pointed out that the image does not show any signs of being created by AI. The idea that the image is AI-generated has spread on Twitter, suggesting that official Israeli accounts are spreading AI-generated misinformation. AI image generators have trouble replicating reality accurately, and the shadows in the photograph are consistent with a real image. Multiple AI image detection tools have also determined that the image is not AI-generated. Source : https://www.404media.co/ai-images-detectors-are-being-used-to-discredit-the-real-horrors-of-war/ submitted by /u/NuseAI [link] [comments]
    Seeking a Community for Open-Source AI Code Generation Models
    Hello everyone! 🌟 I hope this post finds you well. I've been delving deeper into the world of AI code generation recently and am curious to discover if there are communities or platforms specifically dedicated to open-source AI code generation models. I'm aware of hugging face but is there any other besides that. Here's what I'm looking for: Collaboration: A space where enthusiasts and experts alike can collaborate on projects, share insights, and improve upon existing models. Discussion: Forums or chat platforms where discussions around the challenges, breakthroughs, and best practices in AI code generation take place. Resource Sharing: A repository or platform where open-source models, datasets, and related tools can be freely shared and accessed. Learning and Tutorials: Any resources that can help newcomers grasp the concepts and intricacies of AI code generation. If you know of any such community or are part of one, please do let me know. submitted by /u/akanshtyagi [link] [comments]
    Mickey, what are you doing?
    submitted by /u/LeviJr00 [link] [comments]
    Updates to my Capstone Project with Enhanced Features and still freely available to all (until OpenAI credits deplete - Free ChatGPT4). Hoping to introduce the community feature too where people can generate STEM animations to aid learning
    submitted by /u/Raymondlkj [link] [comments]
    learning for school with an AI
    does anyone know if there is an AI online where you can import documents and the AI is forming and asking you questions about that topic on the document? like an AI who generates test for you to be prepared for every potential question in a school test. submitted by /u/satanskittenz [link] [comments]
    Creative Question: Your ideas for AI generative reality
    Ok so we have AI generated content, First text, then images, then videos. What will the world look like when we have a generative world? Generative objects, Generative Games, Generative Moods, Generative memories, Generative senses and perceptions, Generative Environments, Generative Reality. Anyone want to talk about what it might look like? ( I would like to hear a unhinged idea for what might happen, Speculative of course ) submitted by /u/rolyataylor2 [link] [comments]
  • Open

    AI’s Kryptonite: Data Quality
    The ability of Generative AI (GenAI) tools to deliver accurate and reliable outputs entirely depends on the accuracy and reliability of the data used to train the Large Language Models (LLMs) that power the GenAI tool. Unfortunately, the Law of GIGO – Garbage In, Garbage Out – threatens the widespread adoption of GenAI.  Whether generating… Read More »AI’s Kryptonite: Data Quality The post AI’s Kryptonite: Data Quality appeared first on Data Science Central.  ( 22 min )
  • Open

    "Pitfalls of learning a reward function online", Armstrong et al 2020 {DM}
    submitted by /u/gwern [link] [comments]
  • Open

    Newton line
    Let Q be a convex quadrilateral with at most two parallel sides. Draw the two diagonals then draw a line through their midpoints. This line is called the Newton line. (The requirement that at most two sides are parallel insures that the midpoints are distinct and so there is a unique line joining them.) In […] Newton line first appeared on John D. Cook.  ( 5 min )
  • Open

    FABind: Fast and Accurate Protein-Ligand Binding. (arXiv:2310.06763v2 [cs.LG] UPDATED)
    Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reducing discrepancies between training and inference. Through extensive experiments on benchmark datasets, our proposed $\mathbf{FABind}$ demonstrates strong advantages in terms of effectiveness and efficiency compared to existing methods. Our code is available at $\href{https://github.com/QizhiPei/FABind}{Github}$.  ( 2 min )
    On Extreme Value Asymptotics of Projected Sample Covariances in High Dimensions with Applications in Finance and Convolutional Networks. (arXiv:2310.08150v1 [math.ST])
    Maximum-type statistics of certain functions of the sample covariance matrix of high-dimensional vector time series are studied to statistically confirm or reject the null hypothesis that a data set has been collected under normal conditions. The approach generalizes the case of the maximal deviation of the sample autocovariances function from its assumed values. Within a linear time series framework it is shown that Gumbel-type extreme value asymptotics holds true. As applications we discuss long-only mimimal-variance portfolio optimization and subportfolio analysis with respect to idiosyncratic risks, ETF index tracking by sparse tracking portfolios, convolutional deep learners for image analysis and the analysis of array-of-sensors data.  ( 2 min )
    GRASP: Accelerating Shortest Path Attacks via Graph Attention. (arXiv:2310.07980v1 [cs.LG])
    Recent advances in machine learning (ML) have shown promise in aiding and accelerating classical combinatorial optimization algorithms. ML-based speed ups that aim to learn in an end to end manner (i.e., directly output the solution) tend to trade off run time with solution quality. Therefore, solutions that are able to accelerate existing solvers while maintaining their performance guarantees, are of great interest. We consider an APX-hard problem, where an adversary aims to attack shortest paths in a graph by removing the minimum number of edges. We propose the GRASP algorithm: Graph Attention Accelerated Shortest Path Attack, an ML aided optimization algorithm that achieves run times up to 10x faster, while maintaining the quality of solution generated. GRASP uses a graph attention network to identify a smaller subgraph containing the combinatorial solution, thus effectively reducing the input problem size. Additionally, we demonstrate how careful representation of the input graph, including node features that correlate well with the optimization task, can highlight important structure in the optimization solution.  ( 2 min )
    GenTKG: Generative Forecasting on Temporal Knowledge Graph. (arXiv:2310.07793v1 [cs.CL])
    The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional carefully designed embedding-based and rule-based models dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval augmented generation framework that performs generative forecasting on tKGs named GenTKG, which combines a temporal logical rule-based retrieval strategy and lightweight parameter-efficient instruction tuning. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting under low computation resources. GenTKG also highlights remarkable transferability with exceeding performance on unseen datasets without re-training. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs.  ( 2 min )
    Neural Combinatorial Optimization with Heavy Decoder: Toward Large Scale Generalization. (arXiv:2310.07985v1 [cs.LG])
    Neural combinatorial optimization (NCO) is a promising learning-based approach for solving challenging combinatorial optimization problems without specialized algorithm design by experts. However, most constructive NCO methods cannot solve problems with large-scale instance sizes, which significantly diminishes their usefulness for real-world applications. In this work, we propose a novel Light Encoder and Heavy Decoder (LEHD) model with a strong generalization ability to address this critical issue. The LEHD model can learn to dynamically capture the relationships between all available nodes of varying sizes, which is beneficial for model generalization to problems of various scales. Moreover, we develop a data-efficient training scheme and a flexible solution construction mechanism for the proposed LEHD model. By training on small-scale problem instances, the LEHD model can generate nearly optimal solutions for the Travelling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) with up to 1000 nodes, and also generalizes well to solve real-world TSPLib and CVRPLib problems. These results confirm our proposed LEHD model can significantly improve the state-of-the-art performance for constructive NCO. The code is available at https://github.com/CIAM-Group/NCO_code/tree/main/single_objective/LEHD.  ( 2 min )
    Variational operator learning: A unified paradigm marrying training neural operators and solving partial differential equations. (arXiv:2304.04234v2 [cs.LG] UPDATED)
    Neural operators as novel neural architectures for fast approximating solution operators of partial differential equations (PDEs), have shown considerable promise for future scientific computing. However, the mainstream of training neural operators is still data-driven, which needs an expensive ground-truth dataset from various sources (e.g., solving PDEs' samples with the conventional solvers, real-world experiments) in addition to training stage costs. From a computational perspective, marrying operator learning and specific domain knowledge to solve PDEs is an essential step in reducing dataset costs and label-free learning. We propose a novel paradigm that provides a unified framework of training neural operators and solving PDEs with the variational form, which we refer to as the variational operator learning (VOL). Ritz and Galerkin approach with finite element discretization are developed for VOL to achieve matrix-free approximation of system functional and residual, then direct minimization and iterative update are proposed as two optimization strategies for VOL. Various types of experiments based on reasonable benchmarks about variable heat source, Darcy flow, and variable stiffness elasticity are conducted to demonstrate the effectiveness of VOL. With a label-free training set and a 5-label-only shift set, VOL learns solution operators with its test errors decreasing in a power law with respect to the amount of unlabeled data. To the best of the authors' knowledge, this is the first study that integrates the perspectives of the weak form and efficient iterative methods for solving sparse linear systems into the end-to-end operator learning task.  ( 3 min )
    Diffusion-based Generative AI for Exploring Transition States from 2D Molecular Graphs. (arXiv:2304.12233v3 [physics.chem-ph] UPDATED)
    The exploration of transition state (TS) geometries is crucial for elucidating chemical reaction mechanisms and modeling their kinetics. Recently, machine learning (ML) models have shown remarkable performance for prediction of TS geometries. However, they require 3D conformations of reactants and products often with their appropriate orientations as input, which demands substantial efforts and computational cost. Here, we propose a generative approach based on the stochastic diffusion method, namely TSDiff, for prediction of TS geometries just from 2D molecular graphs. TSDiff outperformed the existing ML models with 3D geometries in terms of both accuracy and efficiency. Moreover, it enables to sample various TS conformations, because it learned the distribution of TS geometries for diverse reactions in training. Thus, TSDiff was able to find more favorable reaction pathways with lower barrier heights than those in the reference database. These results demonstrate that TSDiff shows promising potential for an efficient and reliable TS exploration.  ( 2 min )
    LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models. (arXiv:2304.00457v3 [cs.CL] UPDATED)
    Large Language Models (LLMs) have revolutionized natural language processing and demonstrated impressive capabilities in various tasks. Unfortunately, they are prone to hallucinations, where the model exposes incorrect or false information in its responses, which renders diligent evaluation approaches mandatory. While LLM performance in specific knowledge fields is often evaluated based on question and answer (Q&A) datasets, such evaluations usually report only a single accuracy number for the dataset, which often covers an entire field. This field-based evaluation, is problematic with respect to transparency and model improvement. A stratified evaluation could instead reveal subfields, where hallucinations are more likely to occur and thus help to better assess LLMs' risks and guide their further development. To support such stratified evaluations, we propose LLMMaps as a novel visualization technique that enables users to evaluate LLMs' performance with respect to Q&A datasets. LLMMaps provide detailed insights into LLMs' knowledge capabilities in different subfields, by transforming Q&A datasets as well as LLM responses into an internal knowledge structure. An extension for comparative visualization furthermore, allows for the detailed comparison of multiple LLMs. To assess LLMMaps we use them to conduct a comparative analysis of several state-of-the-art LLMs, such as BLOOM, GPT-2, GPT-3, ChatGPT and LLaMa-13B, as well as two qualitative user evaluations. All necessary source code and data for generating LLMMaps to be used in scientific publications and elsewhere is available on GitHub: https://github.com/viscom-ulm/LLMMaps  ( 3 min )
    A general framework for multi-step ahead adaptive conformal heteroscedastic time series forecasting. (arXiv:2207.14219v9 [stat.ML] UPDATED)
    This paper introduces a novel model-agnostic algorithm called adaptive ensemble batch multi-input multi-output conformalized quantile regression (AEnbMIMOCQR} that enables forecasters to generate multi-step ahead prediction intervals for a fixed pre-specified miscoverage rate in a distribution-free manner. Our method is grounded on conformal prediction principles, however, it does not require data splitting and provides close to exact coverage even when the data is not exchangeable. Moreover, the resulting prediction intervals, besides being empirically valid along the forecast horizon, do not neglect heteroscedasticity. AEnbMIMOCQR is designed to be robust to distribution shifts, which means that its prediction intervals remain reliable over an unlimited period of time, without entailing retraining or imposing unrealistic strict assumptions on the data-generating process. Through methodically experimentation, we demonstrate that our approach outperforms other competitive methods on both real-world and synthetic datasets. The code used in the experimental part and a tutorial on how to use AEnbMIMOCQR can be found at the following GitHub repository: https://github.com/Quilograma/AEnbMIMOCQR.  ( 3 min )
    Conditional Mutual Information for Disentangled Representations in Reinforcement Learning. (arXiv:2305.14133v2 [cs.LG] UPDATED)
    Reinforcement Learning (RL) environments can produce training data with spurious correlations between features due to the amount of training data or its limited feature coverage. This can lead to RL agents encoding these misleading correlations in their latent representation, preventing the agent from generalising if the correlation changes within the environment or when deployed in the real world. Disentangled representations can improve robustness, but existing disentanglement techniques that minimise mutual information between features require independent features, thus they cannot disentangle correlated features. We propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation. We demonstrate experimentally, using continuous control tasks, that our approach improves generalisation under correlation shifts, as well as improving the training performance of RL algorithms in the presence of correlated features.  ( 2 min )
    Bengali Document Layout Analysis -- A YOLOV8 Based Ensembling Approach. (arXiv:2309.00848v2 [cs.CV] UPDATED)
    This paper focuses on enhancing Bengali Document Layout Analysis (DLA) using the YOLOv8 model and innovative post-processing techniques. We tackle challenges unique to the complex Bengali script by employing data augmentation for model robustness. After meticulous validation set evaluation, we fine-tune our approach on the complete dataset, leading to a two-stage prediction strategy for accurate element segmentation. Our ensemble model, combined with post-processing, outperforms individual base architectures, addressing issues identified in the BaDLAD dataset. By leveraging this approach, we aim to advance Bengali document analysis, contributing to improved OCR and document comprehension and BaDLAD serves as a foundational resource for this endeavor, aiding future research in the field. Furthermore, our experiments provided key insights to incorporate new strategies into the established solution.
    Flood and Echo: Algorithmic Alignment of GNNs with Distributed Computing. (arXiv:2310.06970v2 [cs.LG] UPDATED)
    Graph Neural Networks are a natural fit for learning algorithms. They can directly represent tasks through an abstract but versatile graph structure and handle inputs of different sizes. This opens up the possibility for scaling and extrapolation to larger graphs, one of the most important advantages of an algorithm. However, this raises two core questions i) How can we enable nodes to gather the required information in a given graph ($\textit{information exchange}$), even if is far away and ii) How can we design an execution framework which enables this information exchange for extrapolation to larger graph sizes ($\textit{algorithmic alignment for extrapolation}$). We propose a new execution framework that is inspired by the design principles of distributed algorithms: Flood and Echo Net. It propagates messages through the entire graph in a wave like activation pattern, which naturally generalizes to larger instances. Through its sparse but parallel activations it is provably more efficient in terms of message complexity. We study the proposed model and provide both empirical evidence and theoretical insights in terms of its expressiveness, efficiency, information exchange and ability to extrapolate.
    WiGenAI: The Symphony of Wireless and Generative AI via Diffusion Models. (arXiv:2310.07312v2 [cs.IT] UPDATED)
    Innovative foundation models, such as GPT-3 and stable diffusion models, have made a paradigm shift in the realm of artificial intelligence (AI) towards generative AI-based systems. In unison, from data communication and networking perspective, AI and machine learning (AI/ML) algorithms are envisioned to be pervasively incorporated into the future generations of wireless communications systems, highlighting the need for novel AI-native solutions for the emergent communication scenarios. In this article, we outline the applications of generative AI in wireless communication systems to lay the foundations for research in this field. Diffusion-based generative models, as the new state-of-the-art paradigm of generative models, are introduced, and their applications in wireless communication systems are discussed. Two case studies are also presented to showcase how diffusion models can be exploited for the development of resilient AI-native communication systems. Specifically, we propose denoising diffusion probabilistic models (DDPM) for a wireless communication scheme with non-ideal transceivers, where 30% improvement is achieved in terms of bit error rate. As the second application, DDPMs are employed at the transmitter to shape the constellation symbols, highlighting a robust out-of-distribution performance. Finally, future directions and open issues for the development of generative AI-based wireless systems are discussed to promote future research endeavors towards wireless generative AI (WiGenAI).
    OWAdapt: An adaptive loss function for deep learning using OWA operators. (arXiv:2305.19443v2 [cs.LG] UPDATED)
    In this paper, we propose a fuzzy adaptive loss function for enhancing deep learning performance in classification tasks. Specifically, we redefine the cross-entropy loss to effectively address class-level noise conditions, including the challenging problem of class imbalance. Our approach introduces aggregation operators, leveraging the power of fuzzy logic to improve classification accuracy. The rationale behind our proposed method lies in the iterative up-weighting of class-level components within the loss function, focusing on those with larger errors. To achieve this, we employ the ordered weighted average (OWA) operator and combine it with an adaptive scheme for gradient-based learning. Through extensive experimentation, our method outperforms other commonly used loss functions, such as the standard cross-entropy or focal loss, across various binary and multiclass classification tasks. Furthermore, we explore the influence of hyperparameters associated with the OWA operators and present a default configuration that performs well across different experimental settings.  ( 2 min )
    ImageNomer: description of a functional connectivity and omics analysis tool and case study identifying a race confound. (arXiv:2302.00767v2 [q-bio.PE] UPDATED)
    Most packages for the analysis of fMRI-based functional connectivity (FC) and genomic data are used with a programming language interface, lacking an easy-to-navigate GUI frontend. This exacerbates two problems found in these types of data: demographic confounds and quality control in the face of high dimensionality of features. The reason is that it is too slow and cumbersome to use a programming interface to create all the necessary visualizations required to identify all correlations, confounding effects, or quality control problems in a dataset. To remedy this situation, we have developed ImageNomer, a data visualization and analysis tool that allows inspection of both subject-level and cohort-level demographic, genomic, and imaging features. The software is Python-based, runs in a self-contained Docker image, and contains a browser-based GUI frontend. We demonstrate the usefulness of ImageNomer by identifying an unexpected race confound when predicting achievement scores in the Philadelphia Neurodevelopmental Cohort (PNC) dataset. In the past, many studies have attempted to use FC to identify achievement-related features in fMRI. Using ImageNomer, we find a clear potential for confounding effects of race. Using correlation analysis in the ImageNomer software, we show that FCs correlated with Wide Range Achievement Test (WRAT) score are in fact more highly correlated with race. Investigating further, we find that whereas both FC and SNP (genomic) features can account for 10-15\% of WRAT score variation, this predictive ability disappears when controlling for race. In this work, we demonstrate the advantage of our ImageNomer GUI tool in data exploration and confound detection. Additionally, this work identifies race as a strong confound in FC data and casts doubt on the possibility of finding unbiased achievement-related features in fMRI and SNP data of healthy adolescents.
    DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies. (arXiv:2310.04610v2 [cs.AI] UPDATED)
    In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.
    Conditional Sig-Wasserstein GANs for Time Series Generation. (arXiv:2006.05421v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) have been extremely successful in generating samples, from seemingly high dimensional probability measures. However, these methods struggle to capture the temporal dependence of joint probability distributions induced by time-series data. Furthermore, long time-series data streams hugely increase the dimension of the target space, which may render generative modelling infeasible. To overcome these challenges, motivated by the autoregressive models in econometric, we are interested in the conditional distribution of future time series given the past information. We propose the generic conditional Sig-WGAN framework by integrating Wasserstein-GANs (WGANs) with mathematically principled and efficient path feature extraction called the signature of a path. The signature of a path is a graded sequence of statistics that provides a universal description for a stream of data, and its expected value characterises the law of the time-series model. In particular, we develop the conditional Sig-$W_1$ metric, that captures the conditional joint law of time series models, and use it as a discriminator. The signature feature space enables the explicit representation of the proposed discriminators which alleviates the need for expensive training. We validate our method on both synthetic and empirical dataset and observe that our method consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.  ( 3 min )
    Smoothed $f$-Divergence Distributionally Robust Optimization. (arXiv:2306.14041v2 [math.OC] UPDATED)
    In data-driven optimization, sample average approximation (SAA) is known to suffer from the so-called optimizer's curse that causes an over-optimistic evaluation of the solution performance. We argue that a special type of distributionallly robust optimization (DRO) formulation offers theoretical advantages in correcting for this optimizer's curse compared to simple ``margin'' adjustments to SAA and other DRO approaches: It attains a statistical bound on the out-of-sample performance, for a wide class of objective functions and distributions, that is nearly tightest in terms of exponential decay rate. This DRO uses an ambiguity set based on a Kullback Leibler (KL) divergence smoothed by the Wasserstein or L\'evy-Prokhorov (LP) distance via a suitable distance optimization. Computationally, we also show that such a DRO, and its generalized versions using smoothed $f$-divergence, are not harder than DRO problems based on $f$-divergence or Wasserstein distances, rendering our DRO formulations both statistically optimal and computationally viable.  ( 2 min )
    Exploring the Relationship Between Model Architecture and In-Context Learning Ability. (arXiv:2310.08049v1 [cs.LG])
    What is the relationship between model architecture and the ability to perform in-context learning? In this empirical study, we take the first steps towards answering this question. In particular, we evaluate fifteen model architectures across a suite of synthetic in-context learning tasks. The selected architectures represent a broad range of paradigms, including recurrent and convolution-based neural networks, transformers, and emerging attention alternatives. We discover that all considered architectures can perform in-context learning under certain conditions. However, contemporary architectures are found to be the best performing, especially as task complexity grows. Additionally, our follow-up experiments delve into various factors that influence in-context learning. We observe varied sensitivities among architectures with respect to hyperparameter settings. Our study of training dynamics reveals that certain architectures exhibit a smooth, progressive learning trajectory, while others demonstrate periods of stagnation followed by abrupt mastery of the task. Finally, and somewhat surprisingly, we find that several emerging attention alternatives are more robust in-context learners than transformers; since such approaches have constant-sized memory footprints at inference time, this result opens the future possibility of scaling up in-context learning to vastly larger numbers of in-context examples.
    BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning. (arXiv:2308.04263v3 [cs.LG] UPDATED)
    This paper introduces BarlowRL, a data-efficient reinforcement learning agent that combines the Barlow Twins self-supervised learning framework with DER (Data-Efficient Rainbow) algorithm. BarlowRL outperforms both DER and its contrastive counterpart CURL on the Atari 100k benchmark. BarlowRL avoids dimensional collapse by enforcing information spread to the whole space. This helps RL algorithms to utilize uniformly spread state representation that eventually results in a remarkable performance. The integration of Barlow Twins with DER enhances data efficiency and achieves superior performance in the RL tasks. BarlowRL demonstrates the potential of incorporating self-supervised learning techniques to improve RL algorithms.
    Network Synthetic Interventions: A Causal Framework for Panel Data Under Network Interference. (arXiv:2210.11355v2 [econ.EM] UPDATED)
    We propose a generalization of the synthetic controls and synthetic interventions methodology to incorporate network interference. We consider the estimation of unit-specific potential outcomes from panel data in the presence of spillover across units and unobserved confounding. Key to our approach is a novel latent factor model that takes into account network interference and generalizes the factor models typically used in panel data settings. We propose an estimator, Network Synthetic Interventions (NSI), and show that it consistently estimates the mean outcomes for a unit under an arbitrary set of counterfactual treatments for the network. We further establish that the estimator is asymptotically normal. We furnish two validity tests for whether the NSI estimator reliably generalizes to produce accurate counterfactual estimates. We provide a novel graph-based experiment design that guarantees the NSI estimator produces accurate counterfactual estimates, and also analyze the sample complexity of the proposed design. We conclude with simulations that corroborate our theoretical findings.
    Towards Data-and Knowledge-Driven Artificial Intelligence: A Survey on Neuro-Symbolic Computing. (arXiv:2210.15889v4 [cs.AI] UPDATED)
    Neural-symbolic computing (NeSy), which pursues the integration of the symbolic and statistical paradigms of cognition, has been an active research area of Artificial Intelligence (AI) for many years. As NeSy shows promise of reconciling the advantages of reasoning and interpretability of symbolic representation and robust learning in neural networks, it may serve as a catalyst for the next generation of AI. In the present paper, we provide a systematic overview of the recent developments and important contributions of NeSy research. Firstly, we introduce study history of this area, covering early work and foundations. We further discuss background concepts and identify key driving factors behind the development of NeSy. Afterward, we categorize recent landmark approaches along several main characteristics that underline this research paradigm, including neural-symbolic integration, knowledge representation, knowledge embedding, and functionality. Next, we briefly discuss the successful application of modern NeSy approaches in several domains. Then, we benchmark several NeSy methods on three representative application tasks. Finally, we identify the open problems together with potential future research directions. This survey is expected to help new researchers enter this rapidly evolving field and accelerate the progress towards data-and knowledge-driven AI.  ( 2 min )
    GePSAn: Generative Procedure Step Anticipation in Cooking Videos. (arXiv:2310.08312v1 [cs.CV])
    We study the problem of future step anticipation in procedural videos. Given a video of an ongoing procedural activity, we predict a plausible next procedure step described in rich natural language. While most previous work focus on the problem of data scarcity in procedural video datasets, another core challenge of future anticipation is how to account for multiple plausible future realizations in natural settings. This problem has been largely overlooked in previous work. To address this challenge, we frame future step prediction as modelling the distribution of all possible candidates for the next step. Specifically, we design a generative model that takes a series of video clips as input, and generates multiple plausible and diverse candidates (in natural language) for the next step. Following previous work, we side-step the video annotation scarcity by pretraining our model on a large text-based corpus of procedural activities, and then transfer the model to the video domain. Our experiments, both in textual and video domains, show that our model captures diversity in the next step prediction and generates multiple plausible future predictions. Moreover, our model establishes new state-of-the-art results on YouCookII, where it outperforms existing baselines on the next step anticipation. Finally, we also show that our model can successfully transfer from text to the video domain zero-shot, ie, without fine-tuning or adaptation, and produces good-quality future step predictions from video.
    GraphControl: Adding Conditional Control to Universal Graph Pre-trained Models for Graph Domain Transfer Learning. (arXiv:2310.07365v2 [cs.LG] UPDATED)
    Graph-structured data is ubiquitous in the world which models complex relationships between objects, enabling various Web applications. Daily influxes of unlabeled graph data on the Web offer immense potential for these applications. Graph self-supervised algorithms have achieved significant success in acquiring generic knowledge from abundant unlabeled graph data. These pre-trained models can be applied to various downstream Web applications, saving training time and improving downstream (target) performance. However, different graphs, even across seemingly similar domains, can differ significantly in terms of attribute semantics, posing difficulties, if not infeasibility, for transferring the pre-trained models to downstream tasks. Concretely speaking, for example, the additional task-specific node information in downstream tasks (specificity) is usually deliberately omitted so that the pre-trained representation (transferability) can be leveraged. The trade-off as such is termed as "transferability-specificity dilemma" in this work. To address this challenge, we introduce an innovative deployment module coined as GraphControl, motivated by ControlNet, to realize better graph domain transfer learning. Specifically, by leveraging universal structural pre-trained models and GraphControl, we align the input space across various graphs and incorporate unique characteristics of target data as conditional inputs. These conditions will be progressively integrated into the model during fine-tuning or prompt tuning through ControlNet, facilitating personalized deployment. Extensive experiments show that our method significantly enhances the adaptability of pre-trained models on target attributed datasets, achieving 1.4-3x performance gain. Furthermore, it outperforms training-from-scratch methods on target data with a comparable margin and exhibits faster convergence.
    Distilling Large Vision-Language Model with Out-of-Distribution Generalizability. (arXiv:2307.03135v3 [cs.CV] UPDATED)
    Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student's OOD generalization: (1) by better imitating teacher's visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher's language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. Poster: https://xuanlinli17.github.io/pdfs/iccv23_large_vlm_distillation_poster.pdf Code: https://github.com/xuanlinli17/large_vlm_distillation_ood  ( 2 min )
    Imitation Learning from Observation with Automatic Discount Scheduling. (arXiv:2310.07433v2 [cs.RO] UPDATED)
    Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.
    PromptTTS 2: Describing and Generating Voices with Text Prompt. (arXiv:2309.02285v2 [eess.AS] UPDATED)
    Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is available online.
    A Neural-preconditioned Poisson Solver for Mixed Dirichlet and Neumann Boundary Conditions. (arXiv:2310.00177v3 [math.NA] UPDATED)
    We introduce a neural-preconditioned iterative solver for Poisson equations with mixed boundary conditions. The Poisson equation is ubiquitous in scientific computing: it governs a wide array of physical phenomena, arises as a subproblem in many numerical algorithms, and serves as a model problem for the broader class of elliptic PDEs. The most popular Poisson discretizations yield large sparse linear systems. At high resolution, and for performance-critical applications, iterative solvers can be advantageous for these -- but only when paired with powerful preconditioners. The core of our solver is a neural network trained to approximate the inverse of a discrete structured-grid Laplace operator for a domain of arbitrary shape and with mixed boundary conditions. The structure of this problem motivates a novel network architecture that we demonstrate is highly effective as a preconditioner even for boundary conditions outside the training set. We show that on challenging test cases arising from an incompressible fluid simulation, our method outperforms state-of-the-art solvers like algebraic multigrid as well as some recent neural preconditioners.
    TEMPO: Prompt-based Generative Pre-trained Transformer for Time Series Forecasting. (arXiv:2310.04948v2 [cs.LG] UPDATED)
    The past decade has witnessed significant advances in time series modeling with deep learning. While achieving state-of-the-art results, the best-performing architectures vary highly across applications and domains. Meanwhile, for natural language processing, the Generative Pre-trained Transformer (GPT) has demonstrated impressive performance via training one general-purpose model across various textual datasets. It is intriguing to explore whether GPT-type architectures can be effective for time series, capturing the intrinsic dynamic attributes and leading to significant accuracy improvements. In this paper, we propose a novel framework, TEMPO, that can effectively learn time series representations. We focus on utilizing two essential inductive biases of the time series task for pre-trained models: (i) decomposition of the complex interaction between trend, seasonal and residual components; and (ii) introducing the selection-based prompts to facilitate distribution adaptation in non-stationary time series. TEMPO expands the capability for dynamically modeling real-world temporal phenomena from data within diverse domains. Our experiments demonstrate the superior performance of TEMPO over state-of-the-art methods on a number of time series benchmark datasets. This performance gain is observed not only in standard supervised learning settings but also in scenarios involving previously unseen datasets as well as in scenarios with multi-modal inputs. This compelling finding highlights TEMPO's potential to constitute a foundational model-building framework.
    Efficient probabilistic reconciliation of forecasts for real-valued and count time series. (arXiv:2210.02286v3 [stat.ML] UPDATED)
    Hierarchical time series are common in several applied fields. The forecasts for these time series are required to be coherent, that is, to satisfy the constraints given by the hierarchy. The most popular technique to enforce coherence is called reconciliation, which adjusts the base forecasts computed for each time series. However, recent works on probabilistic reconciliation present several limitations. In this paper, we propose a new approach based on conditioning to reconcile any type of forecast distribution. We then introduce a new algorithm, called Bottom-Up Importance Sampling, to efficiently sample from the reconciled distribution. It can be used for any base forecast distribution: discrete, continuous, or in the form of samples, providing a major speedup compared to the current methods. Experiments on several temporal hierarchies show a significant improvement over base probabilistic forecasts.  ( 2 min )
    Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts. (arXiv:2310.05898v2 [cs.LG] UPDATED)
    Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.
    Clustering Three-Way Data with Outliers. (arXiv:2310.05288v2 [stat.ML] UPDATED)
    Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with outliers is discussed. The approach, which uses the distribution of subset log-likelihoods, extends the OCLUST algorithm to matrix-variate normal data and uses an iterative approach to detect and trim outliers.
    Federated Generalization via Information-Theoretic Distribution Diversification. (arXiv:2310.07171v2 [cs.LG] UPDATED)
    Federated Learning (FL) has surged in prominence due to its capability of collaborative model training without direct data sharing. However, the vast disparity in local data distributions among clients, often termed the non-Independent Identically Distributed (non-IID) challenge, poses a significant hurdle to FL's generalization efficacy. The scenario becomes even more complex when not all clients participate in the training process, a common occurrence due to unstable network connections or limited computational capacities. This can greatly complicate the assessment of the trained models' generalization abilities. While a plethora of recent studies has centered on the generalization gap pertaining to unseen data from participating clients with diverse distributions, the divergence between the training distributions of participating clients and the testing distributions of non-participating ones has been largely overlooked. In response, our paper unveils an information-theoretic generalization framework for FL. Specifically, it quantifies generalization errors by evaluating the information entropy of local distributions and discerning discrepancies across these distributions. Inspired by our deduced generalization bounds, we introduce a weighted aggregation approach and a duo of client selection strategies. These innovations aim to bolster FL's generalization prowess by encompassing a more varied set of client data distributions. Our extensive empirical evaluations reaffirm the potency of our proposed methods, aligning seamlessly with our theoretical construct.
    SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network. (arXiv:2310.06488v2 [cs.NE] UPDATED)
    Spiking neural networks (SNNs) have demonstrated the capability to achieve comparable performance to deep neural networks (DNNs) in both visual and linguistic domains while offering the advantages of improved energy efficiency and adherence to biological plausibility. However, the extension of such single-modality SNNs into the realm of multimodal scenarios remains an unexplored territory. Drawing inspiration from the concept of contrastive language-image pre-training (CLIP), we introduce a novel framework, named SpikeCLIP, to address the gap between two modalities within the context of spike-based computing through a two-step recipe involving ``Alignment Pre-training + Dual-Loss Fine-tuning". Extensive experiments demonstrate that SNNs achieve comparable results to their DNN counterparts while significantly reducing energy consumption across a variety of datasets commonly used for multimodal model evaluation. Furthermore, SpikeCLIP maintains robust performance in image classification tasks that involve class labels not predefined within specific categories.
    GP-net: Flexible Viewpoint Grasp Proposal. (arXiv:2209.10404v3 [cs.RO] UPDATED)
    We present the Grasp Proposal Network (GP-net), a Convolutional Neural Network model which can generate 6-DoF grasps from flexible viewpoints, e.g. as experienced by mobile manipulators. To train GP-net, we synthetically generate a dataset containing depth-images and ground-truth grasp information. In real-world experiments, we use the EGAD evaluation benchmark to evaluate GP-net against two commonly used algorithms, the Volumetric Grasping Network (VGN) and the Grasp Pose Detection package (GPD), on a PAL TIAGo mobile manipulator. In contrast to the state-of-the-art methods in robotic grasping, GP-net can be used for grasping objects from flexible, unknown viewpoints without the need to define the workspace and achieves a grasp success of 54.4% compared to 51.6% for VGN and 44.2% for GPD. We provide a ROS package along with our code and pre-trained models at https://aucoroboticsmu.github.io/GP-net/.
    FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning. (arXiv:2310.07807v1 [cs.LG])
    Federated learning (FL) is a decentralized machine learning approach where independent learners process data privately. Its goal is to create a robust and accurate model by aggregating and retraining local models over multiple rounds. However, FL faces challenges regarding data heterogeneity and model aggregation effectiveness. In order to simulate real-world data, researchers use methods for data partitioning that transform a dataset designated for centralized learning into a group of sub-datasets suitable for distributed machine learning with different data heterogeneity. In this paper, we study the currently popular data partitioning techniques and visualize their main disadvantages: the lack of precision in the data diversity, which leads to unreliable heterogeneity indexes, and the inability to incrementally challenge the FL algorithms. To resolve this problem, we propose a method that leverages entropy and symmetry to construct 'the most challenging' and controllable data distributions with gradual difficulty. We introduce a metric to measure data heterogeneity among the learning agents and a transformation technique that divides any dataset into splits with precise data diversity. Through a comparative study, we demonstrate the superiority of our method over existing FL data partitioning approaches, showcasing its potential to challenge model aggregation algorithms. Experimental results indicate that our approach gradually challenges the FL strategies, and the models trained on FedSym distributions are more distinct.
    GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models. (arXiv:2310.06225v2 [cs.AI] UPDATED)
    Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding across various domains, including healthcare and finance. For some tasks, LLMs achieve similar or better performance than trained human beings, therefore it is reasonable to employ human exams (e.g., certification tests) to assess the performance of LLMs. We present a comprehensive evaluation of popular LLMs, such as Llama 2 and GPT, on their ability to answer agriculture-related questions. In our evaluation, we also employ RAG (Retrieval-Augmented Generation) and ER (Ensemble Refinement) techniques, which combine information retrieval, generation capabilities, and prompting strategies to improve the LLMs' performance. To demonstrate the capabilities of LLMs, we selected agriculture exams and benchmark datasets from three of the largest agriculture producer countries: Brazil, India, and the USA. Our analysis highlights GPT-4's ability to achieve a passing score on exams to earn credits for renewing agronomist certifications, answering 93% of the questions correctly and outperforming earlier general-purpose models, which achieved 88% accuracy. On one of our experiments, GPT-4 obtained the highest performance when compared to human subjects. This performance suggests that GPT-4 could potentially pass on major graduate education admission tests or even earn credits for renewing agronomy certificates. We also explore the models' capacity to address general agriculture-related questions and generate crop management guidelines for Brazilian and Indian farmers, utilizing robust datasets from the Brazilian Agency of Agriculture (Embrapa) and graduate program exams from India. The results suggest that GPT-4, ER, and RAG can contribute meaningfully to agricultural education, assessment, and crop management practice, offering valuable insights to farmers and agricultural professionals.
    NECO: NEural Collapse Based Out-of-distribution detection. (arXiv:2310.06823v2 [stat.ML] UPDATED)
    Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. We plan to release the code after the anonymity period.
    Locality-Aware Generalizable Implicit Neural Representation. (arXiv:2310.05624v2 [cs.LG] UPDATED)
    Generalizable implicit neural representation (INR) enables a single continuous function, i.e., a coordinate-based neural network, to represent multiple data instances by modulating its weights or intermediate features using latent codes. However, the expressive power of the state-of-the-art modulation is limited due to its inability to localize and capture fine-grained details of data entities such as specific pixels and rays. To address this issue, we propose a novel framework for generalizable INR that combines a transformer encoder with a locality-aware INR decoder. The transformer encoder predicts a set of latent tokens from a data instance to encode local information into each latent token. The locality-aware INR decoder extracts a modulation vector by selectively aggregating the latent tokens via cross-attention for a coordinate input and then predicts the output by progressively decoding with coarse-to-fine modulation through multiple frequency bandwidths. The selective token aggregation and the multi-band feature modulation enable us to learn locality-aware representation in spatial and spectral aspects, respectively. Our framework significantly outperforms previous generalizable INRs and validates the usefulness of the locality-aware latents for downstream tasks such as image generation.
    Defending Our Privacy With Backdoors. (arXiv:2310.08320v1 [cs.LG])
    The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names of individuals from models, and focus in this work on text encoders. Specifically, through strategic insertion of backdoors, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's name. Our empirical results demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides not only a new "dual-use" perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
    On Regularized Sparse Logistic Regression. (arXiv:2309.05925v2 [cs.LG] UPDATED)
    Sparse logistic regression is for classification and feature selection simultaneously. Although many studies have been done to solve $\ell_1$-regularized logistic regression, there is no equivalently abundant work on solving sparse logistic regression with nonconvex regularization term. In this paper, we propose a unified framework to solve $\ell_1$-regularized logistic regression, which can be naturally extended to nonconvex regularization term, as long as certain requirement is satisfied. In addition, we also utilize a different line search criteria to guarantee monotone convergence for various regularization terms. Empirical experiments on binary classification tasks with real-world datasets demonstrate our proposed algorithms are capable of performing classification and feature selection effectively at a lower computational cost.
    Rethinking Negative Pairs in Code Search. (arXiv:2310.08069v1 [cs.SE])
    Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative'' than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. Source code is available at \url{https://github.com/Alex-HaochenLi/Soft-InfoNCE}.
    Learning Collaborative Information Dissemination with Graph-based Multi-Agent Reinforcement Learning. (arXiv:2308.16198v2 [cs.LG] UPDATED)
    In modern communication systems, efficient and reliable information dissemination is crucial for supporting critical operations across domains like disaster response, autonomous vehicles, and sensor networks. This paper introduces a Multi-Agent Reinforcement Learning (MARL) approach as a significant step forward in achieving more decentralized, efficient, and collaborative solutions. We propose a Partially Observable Stochastic Game (POSG) formulation for information dissemination empowering each agent to decide on message forwarding independently, based on their one-hop neighborhood. This constitutes a significant paradigm shift from traditional heuristics based on Multi-Point Relay (MPR) selection. Our approach harnesses Graph Convolutional Reinforcement Learning, employing Graph Attention Networks (GAT) with dynamic attention to capture essential network features. We propose two approaches, L-DGN and HL-DGN, which differ in the information that is exchanged among agents. We evaluate the performance of our decentralized approaches, by comparing them with a widely-used MPR heuristic, and we show that our trained policies are able to efficiently cover the network while bypassing the MPR set selection process. Our approach is a first step toward supporting the resilience of real-world broadcast communication infrastructures via learned, collaborative information dissemination.
    Nest-DGIL: Nesterov-optimized Deep Geometric Incremental Learning for CS Image Reconstruction. (arXiv:2308.03807v2 [eess.IV] UPDATED)
    Proximal gradient-based optimization is one of the most common strategies to solve inverse problem of images, and it is easy to implement. However, these techniques often generate heavy artifacts in image reconstruction. One of the most popular refinement methods is to fine-tune the regularization parameter to alleviate such artifacts, but it may not always be sufficient or applicable due to increased computational costs. In this work, we propose a deep geometric incremental learning framework based on the second Nesterov proximal gradient optimization. The proposed end-to-end network not only has the powerful learning ability for high-/low-frequency image features, but also can theoretically guarantee that geometric texture details will be reconstructed from preliminary linear reconstruction. Furthermore, it can avoid the risk of intermediate reconstruction results falling outside the geometric decomposition domains and achieve fast convergence. Our reconstruction framework is decomposed into four modules including general linear reconstruction, cascade geometric incremental restoration, Nesterov acceleration, and post-processing. In the image restoration step, a cascade geometric incremental learning module is designed to compensate for missing texture information from different geometric spectral decomposition domains. Inspired by the overlap-tile strategy, we also develop a post-processing module to remove the block effect in patch-wise-based natural image reconstruction. All parameters in the proposed model are learnable, an adaptive initialization technique of physical parameters is also employed to make model flexibility and ensure converging smoothly. We compare the reconstruction performance of the proposed method with existing state-of-the-art methods to demonstrate its superiority. Our source codes are available at https://github.com/fanxiaohong/Nest-DGIL.
    COVID-19 Detection Using Swin Transformer Approach from Computed Tomography Images. (arXiv:2310.08165v1 [eess.IV])
    The accurate and efficient diagnosis of COVID-19 is of paramount importance, particularly in the context of large-scale medical imaging datasets. In this preprint paper, we propose a novel approach for COVID-19 diagnosis using CT images that leverages the power of Swin Transformer models, state-of-the-art solutions in computer vision tasks. Our method includes a systematic approach for patient-level predictions, where individual CT slices are classified as COVID-19 or non-COVID, and the patient's overall diagnosis is determined through majority voting. The application of the Swin Transformer in this context results in patient-level predictions that demonstrate exceptional diagnostic accuracy. In terms of evaluation metrics, our approach consistently outperforms the baseline, as well as numerous competing methods, showcasing its effectiveness in COVID-19 diagnosis. The macro F1 score achieved by our model exceeds the baseline and offers a robust solution for accurate diagnosis.
    Learn From Model Beyond Fine-Tuning: A Survey. (arXiv:2310.08184v1 [cs.AI])
    Foundation models (FM) have demonstrated remarkable performance across a wide range of tasks (especially in the fields of natural language processing and computer vision), primarily attributed to their ability to comprehend instructions and access extensive, high-quality data. This not only showcases their current effectiveness but also sets a promising trajectory towards the development of artificial general intelligence. Unfortunately, due to multiple constraints, the raw data of the model used for large model training are often inaccessible, so the use of end-to-end models for downstream tasks has become a new research trend, which we call Learn From Model (LFM) in this article. LFM focuses on the research, modification, and design of FM based on the model interface, so as to better understand the model structure and weights (in a black box environment), and to generalize the model to downstream tasks. The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing. Each category encompasses a repertoire of methods and strategies that aim to enhance the capabilities and performance of FM. This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM, in order to help readers better understand the current research status and ideas. To conclude, we summarize the survey by highlighting several critical areas for future exploration and addressing open issues that require further attention from the research community. The relevant papers we investigated in this article can be accessed at .
    Emergence of Latent Binary Encoding in Deep Neural Network Classifiers. (arXiv:2310.08224v1 [cs.LG])
    We observe the emergence of binary encoding within the latent space of deep-neural-network classifiers. Such binary encoding is induced by introducing a linear penultimate layer, which is equipped during training with a loss function that grows as $\exp(\vec{x}^2)$, where $\vec{x}$ are the coordinates in the latent space. The phenomenon we describe represents a specific instance of a well-documented occurrence known as \textit{neural collapse}, which arises in the terminal phase of training and entails the collapse of latent class means to the vertices of a simplex equiangular tight frame (ETF). We show that binary encoding accelerates convergence toward the simplex ETF and enhances classification accuracy.
    On Training Derivative-Constrained Neural Networks. (arXiv:2310.01649v2 [cs.LG] UPDATED)
    We refer to the setting where the (partial) derivatives of a neural network's (NN's) predictions with respect to its inputs are used as additional training signal as a derivative-constrained (DC) NN. This situation is common in physics-informed settings in the natural sciences. We propose an integrated RELU (IReLU) activation function to improve training of DC NNs. We also investigate denormalization and label rescaling to help stabilize DC training. We evaluate our methods on physics-informed settings including quantum chemistry and Scientific Machine Learning (SciML) tasks. We demonstrate that existing architectures with IReLU activations combined with denormalization and label rescaling better incorporate training signal provided by derivative constraints.
    Generalization bounds for neural ordinary differential equations and deep residual networks. (arXiv:2305.06648v2 [stat.ML] UPDATED)
    Neural ordinary differential equations (neural ODEs) are a popular family of continuous-depth deep learning models. In this work, we consider a large family of parameterized ODEs with continuous-in-time parameters, which include time-dependent neural ODEs. We derive a generalization bound for this class by a Lipschitz-based argument. By leveraging the analogy between neural ODEs and deep residual networks, our approach yields in particular a generalization bound for a class of deep residual networks. The bound involves the magnitude of the difference between successive weight matrices. We illustrate numerically how this quantity affects the generalization capability of neural networks.  ( 2 min )
    Asynchronous Evolution of Deep Neural Network Architectures. (arXiv:2308.04102v2 [cs.NE] UPDATED)
    Many evolutionary algorithms (EAs) take advantage of parallel evaluation of candidates. However, if evaluation times vary significantly, many worker nodes (i.e.,\ compute clients) are idle much of the time, waiting for the next generation to be created. Evolutionary neural architecture search (ENAS), a class of EAs that optimizes the architecture and hyperparameters of deep neural networks, is particularly vulnerable to this issue. This paper proposes a generic asynchronous evaluation strategy (AES) that is then adapted to work with ENAS. AES increases throughput by maintaining a queue of up to $K$ individuals ready to be sent to the workers for evaluation and proceeding to the next generation as soon as $M<<K$ individuals have been evaluated. A suitable value for $M$ is determined experimentally, balancing diversity and efficiency. To showcase the generality and power of AES, it was first evaluated in eight-line sorting network design (a single-population optimization task with limited evaluation-time variability), achieving an over two-fold speedup. Next, it was evaluated in 11-bit multiplexer design (a single-population discovery task with extended variability), where a 14-fold speedup was observed. It was then scaled up to ENAS for image captioning (a multi-population open-ended-optimization task), resulting in an over two-fold speedup. In all problems, a multifold performance improvement was observed, suggesting that AES is a promising method for parallelizing the evolution of complex systems with long and variable evaluation times, such as those in ENAS.
    Measuring Feature Sparsity in Language Models. (arXiv:2310.07837v1 [cs.LG])
    Recent works have proposed that activations in language models can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several language models. We find evidence that language model activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.
    DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning. (arXiv:2309.05173v2 [cs.CL] UPDATED)
    Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving over 20% memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.
    Memorization with neural nets: going beyond the worst case. (arXiv:2310.00327v2 [stat.ML] UPDATED)
    In practice, deep neural networks are often able to easily interpolate their training data. To understand this phenomenon, many works have aimed to quantify the memorization capacity of a neural network architecture: the largest number of points such that the architecture can interpolate any placement of these points with any assignment of labels. For real-world data, however, one intuitively expects the presence of a benign structure so that interpolation already occurs at a smaller network size than suggested by memorization capacity. In this paper, we investigate interpolation by adopting an instance-specific viewpoint. We introduce a simple randomized algorithm that, given a fixed finite dataset with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time. The required number of parameters is linked to geometric properties of the two classes and their mutual arrangement. As a result, we obtain guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds. We illustrate the effectiveness of the algorithm in non-pathological situations with extensive numerical experiments and link the insights back to the theoretical results.
    A Theoretical Explanation of Activation Sparsity through Flat Minima and Adversarial Robustness. (arXiv:2309.03004v2 [cs.LG] UPDATED)
    A recent empirical observation (Li et al., 2022b) of activation sparsity in MLP blocks offers an opportunity to drastically reduce computation costs for free. Although having attributed it to training dynamics, existing theoretical explanations of activation sparsity are restricted to shallow networks, small training steps and special training, despite its emergence in deep models standardly trained for a large number of steps. To fill these gaps, we propose the notion of gradient sparsity as one source of activation sparsity and a theoretical explanation based on it that sees sparsity a necessary step to adversarial robustness w.r.t. hidden features and parameters, which is approximately the flatness of minima for well-learned models. The theory applies to standardly trained LayerNorm-ed MLPs, and further to Transformers or other architectures trained with weight noises. Eliminating other sources of flatness except for sparsity, we discover the phenomenon that the ratio between the largest and smallest non-zero singular values of weight matrices is small. When discussing the emergence of this spectral concentration, we use random matrix theory (RMT) as a powerful tool to analyze stochastic gradient noises. Validational experiments are conducted to verify our gradient-sparsity-based explanation. We propose two plug-and-play modules for both training and finetuning for sparsity. Experiments on ImageNet-1k and C4 demonstrate their 50% sparsity improvements, indicating further potential cost reduction in both training and inference.
    Semantic-Forward Relaying: A Novel Framework Towards 6G Cooperative Communications. (arXiv:2310.07987v1 [cs.NI])
    This letter proposes a novel relaying framework, semantic-forward (SF), for cooperative communications towards the sixth-generation (6G) wireless networks. The SF relay extracts and transmits the semantic features, which reduces forwarding payload, and also improves the network robustness against intra-link errors. Based on the theoretical basis for cooperative communications with side information and the turbo principle, we design a joint source-channel coding algorithm to iteratively exchange the extrinsic information for enhancing the decoding gains at the destination. Surprisingly, simulation results indicate that even in bad channel conditions, SF relaying can still effectively improve the recovered information quality.
    On the Security Vulnerabilities of Text-to-SQL Models. (arXiv:2211.15363v3 [cs.CL] UPDATED)
    Although it has been demonstrated that Natural Language Processing (NLP) algorithms are vulnerable to deliberate attacks, the question of whether such weaknesses can lead to software security threats is under-explored. To bridge this gap, we conducted vulnerability tests on Text-to-SQL systems that are commonly used to create natural language interfaces to databases. We showed that the Text-to-SQL modules within six commercial applications can be manipulated to produce malicious code, potentially leading to data breaches and Denial of Service attacks. This is the first demonstration that NLP models can be exploited as attack vectors in the wild. In addition, experiments using four open-source language models verified that straightforward backdoor attacks on Text-to-SQL systems achieve a 100% success rate without affecting their performance. The aim of this work is to draw the community's attention to potential software security issues associated with NLP algorithms and encourage exploration of methods to mitigate against them.
    Learning to Generate Novel Scientific Directions with Contextualized Literature-based Discovery. (arXiv:2305.14259v3 [cs.CL] UPDATED)
    Literature-Based Discovery (LBD) aims to discover new scientific knowledge by mining papers and generating hypotheses. Standard LBD is limited to predicting pairwise relations between discrete concepts (e.g., drug-disease links), and ignores critical contexts like experimental settings (e.g., a specific patient population where a drug is evaluated) and background motivations (e.g., to find drugs without specific side effects). We address these limitations with a novel formulation of contextualized-LBD (C-LBD): generating scientific hypotheses in natural language, while grounding them in a context that controls the hypothesis search space. We present a modeling framework using retrieval of ``inspirations'' from past scientific papers. Our evaluations reveal that GPT-4 tends to generate ideas with overall low technical depth and novelty, while our inspiration prompting approaches partially mitigate this issue. Our work represents a first step toward building language models that generate new ideas derived from scientific literature.  ( 2 min )
    Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning. (arXiv:2310.07996v1 [cs.LG])
    This work identifies a simple pre-training mechanism that leads to representations exhibiting better continual and transfer learning. This mechanism -- the repeated resetting of weights in the last layer, which we nickname "zapping" -- was originally designed for a meta-continual-learning procedure, yet we show it is surprisingly applicable in many settings beyond both meta-learning and continual learning. In our experiments, we wish to transfer a pre-trained image classifier to a new set of classes, in a few shots. We show that our zapping procedure results in improved transfer accuracy and/or more rapid adaptation in both standard fine-tuning and continual learning settings, while being simple to implement and computationally efficient. In many cases, we achieve performance on par with state of the art meta-learning without needing the expensive higher-order gradients, by using a combination of zapping and sequential learning. An intuitive explanation for the effectiveness of this zapping procedure is that representations trained with repeated zapping learn features that are capable of rapidly adapting to newly initialized classifiers. Such an approach may be considered a computationally cheaper type of, or alternative to, meta-learning rapidly adaptable features with higher-order gradients. This adds to recent work on the usefulness of resetting neural network parameters during training, and invites further investigation of this mechanism.
    Finite Scalar Quantization: VQ-VAE Made Simple. (arXiv:2309.15505v2 [cs.CV] UPDATED)
    We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.
    Neural Diffusion Models. (arXiv:2310.08337v1 [cs.LG])
    Diffusion models have shown remarkable performance on many generative tasks. Despite recent success, most diffusion models are restricted in that they only allow linear transformation of the data distribution. In contrast, broader family of transformations can potentially help train generative distributions more efficiently, simplifying the reverse process and closing the gap between the true negative log-likelihood and the variational approximation. In this paper, we present Neural Diffusion Models (NDMs), a generalization of conventional diffusion models that enables defining and learning time-dependent non-linear transformations of data. We show how to optimise NDMs using a variational bound in a simulation-free setting. Moreover, we derive a time-continuous formulation of NDMs, which allows fast and reliable inference using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the utility of NDMs with learnable transformations through experiments on standard image generation benchmarks, including CIFAR-10, downsampled versions of ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms of likelihood and produce high-quality samples.
    Pure Monte Carlo Counterfactual Regret Minimization. (arXiv:2309.03084v2 [cs.AI] UPDATED)
    Counterfactual Regret Minimization (CFR) and its variants are the best algorithms so far for solving large-scale incomplete information games. However, we believe that there are two problems with CFR: First, matrix multiplication is required in CFR iteration, and the time complexity of one iteration is too high; Secondly, the game characteristics in the real world are different. Just using one CFR algorithm will not be perfectly suitable for all game problems. For these two problems, this paper proposes a new algorithm called Pure CFR (PCFR) based on CFR. PCFR can be seen as a combination of CFR and Fictitious Play (FP), inheriting the concept of counterfactual regret (value) from CFR, and using the best response strategy instead of the regret matching strategy for the next iteration. This algorithm has three advantages. First, PCFR can be combined with any CFR variant. The resulting Pure MCCFR (PMCCFR) can significantly reduce the time and space complexity of one iteration. Secondly, our experiments show that the convergence speed of the PMCCFR is 2$\sim$3 times that of the MCCFR. Finally, there is a type of game that is very suitable for PCFR, we call this type of game clear-game, which is characterized by a high proportion of dominated strategies. Experiments show that in clear-game, the convergence rate of PMCCFR is two orders of magnitude higher than that of MCCFR.
    A Carbon Tracking Model for Federated Learning: Impact of Quantization and Sparsification. (arXiv:2310.08087v1 [eess.SP])
    Federated Learning (FL) methods adopt efficient communication technologies to distribute machine learning tasks across edge devices, reducing the overhead in terms of data storage and computational complexity compared to centralized solutions. Rather than moving large data volumes from producers (sensors, machines) to energy-hungry data centers, raising environmental concerns due to resource demands, FL provides an alternative solution to mitigate the energy demands of several learning tasks while enabling new Artificial Intelligence of Things (AIoT) applications. This paper proposes a framework for real-time monitoring of the energy and carbon footprint impacts of FL systems. The carbon tracking tool is evaluated for consensus (fully decentralized) and classical FL policies. For the first time, we present a quantitative evaluation of different computationally and communication efficient FL methods from the perspectives of energy consumption and carbon equivalent emissions, suggesting also general guidelines for energy-efficient design. Results indicate that consensus-driven FL implementations should be preferred for limiting carbon emissions when the energy efficiency of the communication is low (i.e., < 25 Kbit/Joule). Besides, quantization and sparsification operations are shown to strike a balance between learning performances and energy consumption, leading to sustainable FL designs.
    Deep Reinforcement Learning for Autonomous Cyber Operations: A Survey. (arXiv:2310.07745v1 [cs.LG])
    The rapid increase in the number of cyber-attacks in recent years raises the need for principled methods for defending networks against malicious actors. Deep reinforcement learning (DRL) has emerged as a promising approach for mitigating these attacks. However, while DRL has shown much potential for cyber-defence, numerous challenges must be overcome before DRL can be applied to autonomous cyber-operations (ACO) at scale. Principled methods are required for environments that confront learners with very high-dimensional state spaces, large multi-discrete action spaces, and adversarial learning. Recent works have reported success in solving these problems individually. There have also been impressive engineering efforts towards solving all three for real-time strategy games. However, applying DRL to the full ACO problem remains an open challenge. Here, we survey the relevant DRL literature and conceptualize an idealised ACO-DRL agent. We provide: i.) A summary of the domain properties that define the ACO problem; ii.) A comprehensive evaluation of the extent to which domains used for benchmarking DRL approaches are comparable to ACO; iii.) An overview of state-of-the-art approaches for scaling DRL to domains that confront learners with the curse of dimensionality, and; iv.) A survey and critique of current methods for limiting the exploitability of agents within adversarial settings from the perspective of ACO. We conclude with open research questions that we hope will motivate future directions for researchers and practitioners working on ACO.
    Towards the Fundamental Limits of Knowledge Transfer over Finite Domains. (arXiv:2310.07838v1 [cs.LG])
    We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.
    Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning. (arXiv:2307.03486v2 [cs.LG] UPDATED)
    Discovering achievements with a hierarchical structure in procedurally generated environments presents a significant challenge. This requires an agent to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods have been built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be advantageous for learning hierarchical dependencies. However, these methods demand an excessive number of environment interactions or large model sizes, limiting their practicality. In this work, we demonstrate that proximal policy optimization (PPO), a simple yet versatile model-free algorithm, outperforms previous methods when optimized with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, albeit with limited confidence. Based on this observation, we introduce a novel contrastive learning method, called achievement distillation, which strengthens the agent's ability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment in a sample-efficient manner while utilizing fewer model parameters.
    Conformal inference for regression on Riemannian Manifolds. (arXiv:2310.08209v1 [stat.ML])
    Regression on manifolds, and, more broadly, statistics on manifolds, has garnered significant importance in recent years due to the vast number of applications for this type of data. Circular data is a classic example, but so is data in the space of covariance matrices, data on the Grassmannian manifold obtained as a result of principal component analysis, among many others. In this work we investigate prediction sets for regression scenarios when the response variable, denoted by $Y$, resides in a manifold, and the covariable, denoted by X, lies in Euclidean space. This extends the concepts delineated in [Lei and Wasserman, 2014] to this novel context. Aligning with traditional principles in conformal inference, these prediction sets are distribution-free, indicating that no specific assumptions are imposed on the joint distribution of $(X, Y)$, and they maintain a non-parametric character. We prove the asymptotic almost sure convergence of the empirical version of these regions on the manifold to their population counterparts. The efficiency of this method is shown through a comprehensive simulation study and an analysis involving real-world data.
    Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets. (arXiv:2310.04413v2 [cs.LG] UPDATED)
    Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning techniques such as behavior cloning is to find a policy that achieves a higher average return than the trajectories constituting the dataset. However, we empirically find that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We argue this is due to an assumption made by current offline RL algorithms of staying close to the trajectories in the dataset. If the dataset primarily consists of sub-optimal trajectories, this assumption forces the policy to mimic the suboptimal actions. We overcome this issue by proposing a sampling strategy that enables the policy to only be constrained to ``good data" rather than all actions in the dataset (i.e., uniform sampling). We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms. Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms. Code is available at https://github.com/Improbable-AI/dw-offline-rl.
    Provably Efficient Offline Goal-Conditioned Reinforcement Learning with General Function Approximation and Single-Policy Concentrability. (arXiv:2302.03770v2 [cs.LG] UPDATED)
    Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills that aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we provide a rigorous theoretical analysis of an existing empirically successful offline GCRL algorithm. We prove that under slight modification, this algorithm enjoys an $\widetilde{O}(\text{poly}(1/\epsilon))$ sample complexity (where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation thanks to the property of (semi-)strong convexity of the objective functions. We only require nearly minimal assumptions on the dataset (single-policy concentrability) and the function class (realizability). Moreover, this algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. We also empirically validate our theory by showing that the modified algorithm outperforms the previous algorithm in various real-world environments. To the best of our knowledge, this is the first algorithm that is both provably efficient with general function approximation and single-policy concentrability, and empirically successful without requiring solving minimax optimization problems.  ( 3 min )
    Analyzing And Editing Inner Mechanisms Of Backdoored Language Models. (arXiv:2302.12461v2 [cs.LG] UPDATED)
    Poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. A description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. In this work, we study the internal representations of transformer-based backdoored language models and determine early-layer MLP modules as most important for the backdoor mechanism in combination with the initial embedding projection. We use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the MLP module outputs to essentials for the backdoor mechanism. To this end, we introduce PCP ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. We demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. We show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets. Trigger warning: Offensive language.
    Quantum-Enhanced Forecasting: Leveraging Quantum Gramian Angular Field and CNNs for Stock Return Predictions. (arXiv:2310.07427v2 [cs.LG] UPDATED)
    We propose a time series forecasting method named Quantum Gramian Angular Field (QGAF). This approach merges the advantages of quantum computing technology with deep learning, aiming to enhance the precision of time series classification and forecasting. We successfully transformed stock return time series data into two-dimensional images suitable for Convolutional Neural Network (CNN) training by designing specific quantum circuits. Distinct from the classical Gramian Angular Field (GAF) approach, QGAF's uniqueness lies in eliminating the need for data normalization and inverse cosine calculations, simplifying the transformation process from time series data to two-dimensional images. To validate the effectiveness of this method, we conducted experiments on datasets from three major stock markets: the China A-share market, the Hong Kong stock market, and the US stock market. Experimental results revealed that compared to the classical GAF method, the QGAF approach significantly improved time series prediction accuracy, reducing prediction errors by an average of 25% for Mean Absolute Error (MAE) and 48% for Mean Squared Error (MSE). This research confirms the potential and promising prospects of integrating quantum computing with deep learning techniques in financial time series forecasting.
    Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning. (arXiv:2310.07918v1 [cs.LG])
    Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.
    NuTime: Numerically Multi-Scaled Embedding for Large-Scale Time Series Pretraining. (arXiv:2310.07402v2 [cs.LG] UPDATED)
    Recent research on time-series self-supervised models shows great promise in learning semantic representations. However, it has been limited to small-scale datasets, e.g., thousands of temporal sequences. In this work, we make key technical contributions that are tailored to the numerical properties of time-series data and allow the model to scale to large datasets, e.g., millions of temporal sequences. We adopt the Transformer architecture by first partitioning the input into non-overlapping windows. Each window is then characterized by its normalized shape and two scalar values denoting the mean and standard deviation within each window. To embed scalar values that may possess arbitrary numerical scales to high-dimensional vectors, we propose a numerically multi-scaled embedding module enumerating all possible scales for the scalar values. The model undergoes pretraining using the proposed numerically multi-scaled embedding with a simple contrastive objective on a large-scale dataset containing over a million sequences. We study its transfer performance on a number of univariate and multivariate classification benchmarks. Our method exhibits remarkable improvement against previous representation learning approaches and establishes the new state of the art, even compared with domain-specific non-learning-based methods.
    MMTSA: Multimodal Temporal Segment Attention Network for Efficient Human Activity Recognition. (arXiv:2210.09222v2 [cs.CV] UPDATED)
    Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements 11.13% of cross-subject F1-score on the MMAct dataset than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.  ( 2 min )
    A Comprehensive Review on Tree Detection Methods Using Point Cloud and Aerial Imagery from Unmanned Aerial Vehicles. (arXiv:2309.16375v2 [cs.CV] CROSS LISTED)
    Unmanned Aerial Vehicles (UAVs) are considered cutting-edge technology with highly cost-effective and flexible usage scenarios. Although many papers have reviewed the application of UAVs in agriculture, the review of the application for tree detection is still insufficient. This paper focuses on tree detection methods applied to UAV data collected by UAVs. There are two kinds of data, the point cloud and the images, which are acquired by the Light Detection and Ranging (LiDAR) sensor and camera, respectively. Among the detection methods using point-cloud data, this paper mainly classifies these methods according to LiDAR and Digital Aerial Photography (DAP). For the detection methods using images directly, this paper reviews these methods by whether or not to use the Deep Learning (DL) method. Our review concludes and analyses the comparison and combination between the application of LiDAR-based and DAP-based point cloud data. The performance, relative merits, and application fields of the methods are also introduced. Meanwhile, this review counts the number of tree detection studies using different methods in recent years. From our statics, the detection task using DL methods on the image has become a mainstream trend as the number of DL-based detection researches increases to 45% of the total number of tree detection studies up to 2022. As a result, this review could help and guide researchers who want to carry out tree detection on specific forests and for farmers to use UAVs in managing agriculture production.
    Core-sets for Fair and Diverse Data Summarization. (arXiv:2310.08122v1 [cs.DS])
    We study core-set construction algorithms for the task of Diversity Maximization under fairness/partition constraint. Given a set of points $P$ in a metric space partitioned into $m$ groups, and given $k_1,\ldots,k_m$, the goal of this problem is to pick $k_i$ points from each group $i$ such that the overall diversity of the $k=\sum_i k_i$ picked points is maximized. We consider two natural diversity measures: sum-of-pairwise distances and sum-of-nearest-neighbor distances, and show improved core-set construction algorithms with respect to these measures. More precisely, we show the first constant factor core-set w.r.t. sum-of-pairwise distances whose size is independent of the size of the dataset and the aspect ratio. Second, we show the first core-set w.r.t. the sum-of-nearest-neighbor distances. Finally, we run several experiments showing the effectiveness of our core-set approach. In particular, we apply constrained diversity maximization to summarize a set of timed messages that takes into account the messages' recency. Specifically, the summary should include more recent messages compared to older ones. This is a real task in one of the largest communication platforms, affecting the experience of hundreds of millions daily active users. By utilizing our core-set method for this task, we achieve a 100x speed-up while losing the diversity by only a few percent. Moreover, our approach allows us to improve the space usage of the algorithm in the streaming setting.
    MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement Learning. (arXiv:2310.08252v1 [cs.LG])
    Recently, Meta-Black-Box Optimization with Reinforcement Learning (MetaBBO-RL) has showcased the power of leveraging RL at the meta-level to mitigate manual fine-tuning of low-level black-box optimizers. However, this field is hindered by the lack of a unified benchmark. To fill this gap, we introduce MetaBox, the first benchmark platform expressly tailored for developing and evaluating MetaBBO-RL methods. MetaBox offers a flexible algorithmic template that allows users to effortlessly implement their unique designs within the platform. Moreover, it provides a broad spectrum of over 300 problem instances, collected from synthetic to realistic scenarios, and an extensive library of 19 baseline methods, including both traditional black-box optimizers and recent MetaBBO-RL methods. Besides, MetaBox introduces three standardized performance metrics, enabling a more thorough assessment of the methods. In a bid to illustrate the utility of MetaBox for facilitating rigorous evaluation and in-depth analysis, we carry out a wide-ranging benchmarking study on existing MetaBBO-RL methods. Our MetaBox is open-source and accessible at: https://github.com/GMC-DRL/MetaBox.
    Generative modeling of time-dependent densities via optimal transport and projection pursuit. (arXiv:2304.09663v2 [stat.ML] UPDATED)
    Motivated by the computational difficulties incurred by popular deep learning algorithms for the generative modeling of temporal densities, we propose a cheap alternative which requires minimal hyperparameter tuning and scales favorably to high dimensional problems. In particular, we use a projection-based optimal transport solver [Meng et al., 2019] to join successive samples and subsequently use transport splines [Chewi et al., 2020] to interpolate the evolving density. When the sampling frequency is sufficiently high, the optimal maps are close to the identity and are thus computationally efficient to compute. Moreover, the training process is highly parallelizable as all optimal maps are independent and can thus be learned simultaneously. Finally, the approach is based solely on numerical linear algebra rather than minimizing a nonconvex objective function, allowing us to easily analyze and control the algorithm. We present several numerical experiments on both synthetic and real-world datasets to demonstrate the efficiency of our method. In particular, these experiments show that the proposed approach is highly competitive compared with state-of-the-art normalizing flows conditioned on time across a wide range of dimensionalities.
    An interpretable neural network-based non-proportional odds model for ordinal regression. (arXiv:2303.17823v3 [stat.ME] UPDATED)
    This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is designed to directly handle continuous responses, whereas standard methods typically treat de facto ordered continuous variables as discrete, (2) instead of estimating response-dependent finite coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets.
    Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts. (arXiv:2302.02931v2 [cs.LG] UPDATED)
    Training machine learning models robust to distribution shifts is critical for real-world applications. Some robust training algorithms (e.g., Group DRO) specialize to group shifts and require group information on all training points. Other methods (e.g., CVaR DRO) that do not need group annotations can be overly conservative, since they naively upweight high loss points which may form a contrived set that does not correspond to any meaningful group in the real world (e.g., when the high loss points are randomly mislabeled training points). In this work, we address limitations in prior approaches by assuming a more nuanced form of group shift: conditioned on the label, we assume that the true group function (indicator over group) is simple. For example, we may expect that group shifts occur along low bitrate features (e.g., image background, lighting). Thus, we aim to learn a model that maintains high accuracy on simple group functions realized by these low bitrate features, that need not spend valuable model capacity achieving high accuracy on contrived groups of examples. Based on this, we consider the two-player game formulation of DRO where the adversary's capacity is bitrate-constrained. Our resulting practical algorithm, Bitrate-Constrained DRO (BR-DRO), does not require group information on training samples yet matches the performance of Group DRO on datasets that have training group annotations and that of CVaR DRO on long-tailed distributions. Our theoretical analysis reveals that in some settings BR-DRO objective can provably yield statistically efficient and less conservative solutions than unconstrained CVaR DRO.
    Explainable Attention for Few-shot Learning and Beyond. (arXiv:2310.07800v1 [cs.AI])
    Attention mechanisms have exhibited promising potential in enhancing learning models by identifying salient portions of input data. This is particularly valuable in scenarios where limited training samples are accessible due to challenges in data collection and labeling. Drawing inspiration from human recognition processes, we posit that an AI baseline's performance could be more accurate and dependable if it is exposed to essential segments of raw data rather than the entire input dataset, akin to human perception. However, the task of selecting these informative data segments, referred to as hard attention finding, presents a formidable challenge. In situations with few training samples, existing studies struggle to locate such informative regions due to the large number of training parameters that cannot be effectively learned from the available limited samples. In this study, we introduce a novel and practical framework for achieving explainable hard attention finding, specifically tailored for few-shot learning scenarios, called FewXAT. Our approach employs deep reinforcement learning to implement the concept of hard attention, directly impacting raw input data and thus rendering the process interpretable for human understanding. Through extensive experimentation across various benchmark datasets, we demonstrate the efficacy of our proposed method.
    Reinforcement Learning of Display Transfer Robots in Glass Flow Control Systems: A Physical Simulation-Based Approach. (arXiv:2310.07981v1 [cs.LG])
    A flow control system is a critical concept for increasing the production capacity of manufacturing systems. To solve the scheduling optimization problem related to the flow control with the aim of improving productivity, existing methods depend on a heuristic design by domain human experts. Therefore, the methods require correction, monitoring, and verification by using real equipment. As system designs increase in complexity, the monitoring time increases, which decreases the probability of arriving at the optimal design. As an alternative approach to the heuristic design of flow control systems, the use of deep reinforcement learning to solve the scheduling optimization problem has been considered. Although the existing research on reinforcement learning has yielded excellent performance in some areas, the applicability of the results to actual FAB such as display and semiconductor manufacturing processes is not evident so far. To this end, we propose a method to implement a physical simulation environment and devise a feasible flow control system design using a transfer robot in display manufacturing through reinforcement learning. We present a model and parameter setting to build a virtual environment for different display transfer robots, and training methods of reinforcement learning on the environment to obtain an optimal scheduling of glass flow control systems. Its feasibility was verified by using different types of robots used in the actual process.
    Identifying latent distances with Finslerian geometry. (arXiv:2212.10010v2 [cs.LG] UPDATED)
    Riemannian geometry provides us with powerful tools to explore the latent space of generative models while preserving the underlying structure of the data. The latent space can be equipped it with a Riemannian metric, pulled back from the data manifold. With this metric, we can systematically navigate the space relying on geodesics defined as the shortest curves between two points. Generative models are often stochastic, causing the data space, the Riemannian metric, and the geodesics, to be stochastic as well. Stochastic objects are at best impractical, and at worst impossible, to manipulate. A common solution is to approximate the stochastic pullback metric by its expectation. But the geodesics derived from this expected Riemannian metric do not correspond to the expected length-minimising curves. In this work, we propose another metric whose geodesics explicitly minimise the expected length of the pullback metric. We show this metric defines a Finsler metric, and we compare it with the expected Riemannian metric. In high dimensions, we prove that both metrics converge to each other at a rate of $O\left(\frac{1}{D}\right)$. This convergence implies that the established expected Riemannian metric is an accurate approximation of the theoretically more grounded Finsler metric. This provides justification for using the expected Riemannian metric for practical implementations.
    Theoretical Hardness and Tractability of POMDPs in RL with Partial Online State Information. (arXiv:2306.08762v2 [cs.LG] UPDATED)
    Partially observable Markov decision processes (POMDPs) have been widely applied to capture many real-world applications. However, existing theoretical results have shown that learning in general POMDPs could be intractable, where the main challenge lies in the lack of latent state information. A key fundamental question here is how much online state information (OSI) is sufficient to achieve tractability. In this paper, we establish a lower bound that reveals a surprising hardness result: unless we have full OSI, we need an exponentially scaling sample complexity to obtain an $\epsilon$-optimal policy solution for POMDPs. Nonetheless, inspired by the key insights in our lower bound design, we find that there exist important tractable classes of POMDPs even with only partial OSI. In particular, for two novel classes of POMDPs with partial OSI, we provide new algorithms that are proved to be near-optimal by establishing new regret upper and lower bounds.
    Infinite Width Graph Neural Networks for Node Regression/ Classification. (arXiv:2310.08176v1 [cs.LG])
    This work analyzes Graph Neural Networks, a generalization of Fully-Connected Deep Neural Nets on Graph structured data, when their width, that is the number of nodes in each fullyconnected layer is increasing to infinity. Infinite Width Neural Networks are connecting Deep Learning to Gaussian Processes and Kernels, both Machine Learning Frameworks with long traditions and extensive theoretical foundations. Gaussian Processes and Kernels have much less hyperparameters then Neural Networks and can be used for uncertainty estimation, making them more user friendly for applications. This works extends the increasing amount of research connecting Gaussian Processes and Kernels to Neural Networks. The Kernel and Gaussian Process closed forms are derived for a variety of architectures, namely the standard Graph Neural Network, the Graph Neural Network with Skip-Concatenate Connections and the Graph Attention Neural Network. All architectures are evaluated on a variety of datasets on the task of transductive Node Regression and Classification. Additionally, a Spectral Sparsification method known as Effective Resistance is used to improve runtime and memory requirements. Extending the setting to inductive graph learning tasks (Graph Regression/ Classification) is straightforward and is briefly discussed in 3.5.
    Tight Time-Space Lower Bounds for Constant-Pass Learning. (arXiv:2310.08070v1 [cs.LG])
    In his breakthrough paper, Raz showed that any parity learning algorithm requires either quadratic memory or an exponential number of samples [FOCS'16, JACM'19]. A line of work that followed extended this result to a large class of learning problems. Until recently, all these results considered learning in the streaming model, where each sample is drawn independently, and the learner is allowed a single pass over the stream of samples. Garg, Raz, and Tal [CCC'19] considered a stronger model, allowing multiple passes over the stream. In the $2$-pass model, they showed that learning parities of size $n$ requires either a memory of size $n^{1.5}$ or at least $2^{\sqrt{n}}$ samples. (Their result also generalizes to other learning problems.) In this work, for any constant $q$, we prove tight memory-sample lower bounds for any parity learning algorithm that makes $q$ passes over the stream of samples. We show that such a learner requires either $\Omega(n^{2})$ memory size or at least $2^{\Omega(n)}$ samples. Beyond establishing a tight lower bound, this is the first non-trivial lower bound for $q$-pass learning for any $q\ge 3$. Similar to prior work, our results extend to any learning problem with many nearly-orthogonal concepts. We complement the lower bound with an upper bound, showing that parity learning with $q$ passes can be done efficiently with $O(n^2/\log q)$ memory.
    AutoFHE: Automated Adaption of CNNs for Efficient Evaluation over FHE. (arXiv:2310.08012v1 [cs.LG])
    Secure inference of deep convolutional neural networks (CNNs) under RNS-CKKS involves polynomial approximation of unsupported non-linear activation functions. However, existing approaches have three main limitations: 1) Inflexibility: The polynomial approximation and associated homomorphic evaluation architecture are customized manually for each CNN architecture and do not generalize to other networks. 2) Suboptimal Approximation: Each activation function is approximated instead of the function represented by the CNN. 3) Restricted Design: Either high-degree or low-degree polynomial approximations are used. The former retains high accuracy but slows down inference due to bootstrapping operations, while the latter accelerates ciphertext inference but compromises accuracy. To address these limitations, we present AutoFHE, which automatically adapts standard CNNs for secure inference under RNS-CKKS. The key idea is to adopt layerwise mixed-degree polynomial activation functions, which are optimized jointly with the homomorphic evaluation architecture in terms of the placement of bootstrapping operations. The problem is modeled within a multi-objective optimization framework to maximize accuracy and minimize the number of bootstrapping operations. AutoFHE can be applied flexibly on any CNN architecture, and it provides diverse solutions that span the trade-off between accuracy and latency. Experimental evaluation over RNS-CKKS encrypted CIFAR datasets shows that AutoFHE accelerates secure inference by $1.32\times$ to $1.8\times$ compared to methods employing high-degree polynomials. It also improves accuracy by up to 2.56% compared to methods using low-degree polynomials. Lastly, AutoFHE accelerates inference and improves accuracy by $103\times$ and 3.46%, respectively, compared to CNNs under TFHE.
    Impact of multi-armed bandit strategies on deep recurrent reinforcement learning. (arXiv:2310.08331v1 [stat.ML])
    Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. Such as when only 2D images are considered as input in a RL approach used for finding the optimal action within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenario. More precisely, the final aim is to investigate the effects of using both stochastic and deterministic multi-armed bandit strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of an innovative method to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We aim to show that adaptive stochastic methods for exploration better approximate the trade-off between exploration and exploitation as, in general, Softmax and Max-Boltzmann strategies are able to outperform epsilon-greedy techniques.
    Generative Intrinsic Optimization: Intrisic Control with Model Learning. (arXiv:2310.08100v1 [cs.LG])
    Future sequence represents the outcome after executing the action into the environment. When driven by the information-theoretic concept of mutual information, it seeks maximally informative consequences. Explicit outcomes may vary across state, return, or trajectory serving different purposes such as credit assignment or imitation learning. However, the inherent nature of incorporating intrinsic motivation with reward maximization is often neglected. In this work, we propose a variational approach to jointly learn the necessary quantity for estimating the mutual information and the dynamics model, providing a general framework for incorporating different forms of outcomes of interest. Integrated into a policy iteration scheme, our approach guarantees convergence to the optimal policy. While we mainly focus on theoretical analysis, our approach opens the possibilities of leveraging intrinsic control with model learning to enhance sample efficiency and incorporate uncertainty of the environment into decision-making.
    Efficient Integrators for Diffusion Generative Models. (arXiv:2310.07894v1 [cs.LG])
    Diffusion models suffer from slow sample generation at inference time. Therefore, developing a principled framework for fast deterministic/stochastic sampling for a broader class of diffusion models is a promising direction. We propose two complementary frameworks for accelerating sample generation in pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate integrators generalize DDIM, mapping the reverse diffusion dynamics to a more amenable space for sampling. In contrast, splitting-based integrators, commonly used in molecular dynamics, reduce the numerical simulation error by cleverly alternating between numerical updates involving the data and auxiliary variables. After extensively studying these methods empirically and theoretically, we present a hybrid method that leads to the best-reported performance for diffusion models in augmented spaces. Applied to Phase Space Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing baselines, respectively. Our code and model checkpoints will be made publicly available at \url{https://github.com/mandt-lab/PSLD}.
    The Thousand Faces of Explainable AI Along the Machine Learning Life Cycle: Industrial Reality and Current State of Research. (arXiv:2310.07882v1 [cs.LG])
    In this paper, we investigate the practical relevance of explainable artificial intelligence (XAI) with a special focus on the producing industries and relate them to the current state of academic XAI research. Our findings are based on an extensive series of interviews regarding the role and applicability of XAI along the Machine Learning (ML) lifecycle in current industrial practice and its expected relevance in the future. The interviews were conducted among a great variety of roles and key stakeholders from different industry sectors. On top of that, we outline the state of XAI research by providing a concise review of the relevant literature. This enables us to provide an encompassing overview covering the opinions of the surveyed persons as well as the current state of academic research. By comparing our interview results with the current research approaches we reveal several discrepancies. While a multitude of different XAI approaches exists, most of them are centered around the model evaluation phase and data scientists. Their versatile capabilities for other stages are currently either not sufficiently explored or not popular among practitioners. In line with existing work, our findings also confirm that more efforts are needed to enable also non-expert users' interpretation and understanding of opaque AI models with existing methods and frameworks.
    Seeing-Eye Quadruped Navigation with Force Responsive Locomotion Control. (arXiv:2309.04370v2 [cs.RO] UPDATED)
    Seeing-eye robots are very useful tools for guiding visually impaired people, potentially producing a huge societal impact given the low availability and high cost of real guide dogs. Although a few seeing-eye robot systems have already been demonstrated, none considered external tugs from humans, which frequently occur in a real guide dog setting. In this paper, we simultaneously train a locomotion controller that is robust to external tugging forces via Reinforcement Learning (RL), and an external force estimator via supervised learning. The controller ensures stable walking, and the force estimator enables the robot to respond to the external forces from the human. These forces are used to guide the robot to the global goal, which is unknown to the robot, while the robot guides the human around nearby obstacles via a local planner. Experimental results in simulation and on hardware show that our controller is robust to external forces, and our seeing-eye system can accurately detect force direction. We demonstrate our full seeing-eye robot system on a real quadruped robot with a blindfolded human. The video can be seen at our project page: https://bu-air-lab.github.io/guide_dog/
    LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs. (arXiv:2308.08469v3 [cs.LG] UPDATED)
    In this work, we leverage pre-trained Large Language Models (LLMs) to enhance time-series forecasting. Mirroring the growing interest in unifying models for Natural Language Processing and Computer Vision, we envision creating an analogous model for long-term time-series forecasting. Due to limited large-scale time-series data for building robust foundation models, our approach LLM4TS focuses on leveraging the strengths of pre-trained LLMs. By combining time-series patching with temporal encoding, we have enhanced the capability of LLMs to handle time-series data effectively. Inspired by the supervised fine-tuning in chatbot domains, we prioritize a two-stage fine-tuning process: first conducting supervised fine-tuning to orient the LLM towards time-series data, followed by task-specific downstream fine-tuning. Furthermore, to unlock the flexibility of pre-trained LLMs without extensive parameter adjustments, we adopt several Parameter-Efficient Fine-Tuning (PEFT) techniques. Drawing on these innovations, LLM4TS has yielded state-of-the-art results in long-term forecasting. Our model has also shown exceptional capabilities as both a robust representation learner and an effective few-shot learner, thanks to the knowledge transferred from the pre-trained LLM.
    Multi-Objective Optimization for Sparse Deep Neural Network Training. (arXiv:2308.12243v2 [cs.LG] UPDATED)
    Different conflicting optimization criteria arise naturally in various Deep Learning scenarios. These can address different main tasks (i.e., in the setting of Multi-Task Learning), but also main and secondary tasks such as loss minimization versus sparsity. The usual approach is a simple weighting of the criteria, which formally only works in the convex setting. In this paper, we present a Multi-Objective Optimization algorithm using a modified Weighted Chebyshev scalarization for training Deep Neural Networks (DNNs) with respect to several tasks. By employing this scalarization technique, the algorithm can identify all optimal solutions of the original problem while reducing its complexity to a sequence of single-objective problems. The simplified problems are then solved using an Augmented Lagrangian method, enabling the use of popular optimization techniques such as Adam and Stochastic Gradient Descent, while efficaciously handling constraints. Our work aims to address the (economical and also ecological) sustainability issue of DNN models, with a particular focus on Deep Multi-Task models, which are typically designed with a very large number of weights to perform equally well on multiple tasks. Through experiments conducted on two Machine Learning datasets, we demonstrate the possibility of adaptively sparsifying the model during training without significantly impacting its performance, if we are willing to apply task-specific adaptations to the network weights. Code is available at https://github.com/salomonhotegni/MDMTN.
    Continual Learning via Manifold Expansion Replay. (arXiv:2310.08038v1 [cs.LG])
    In continual learning, the learner learns multiple tasks in sequence, with data being acquired only once for each task. Catastrophic forgetting is a major challenge to continual learning. To reduce forgetting, some existing rehearsal-based methods use episodic memory to replay samples of previous tasks. However, in the process of knowledge integration when learning a new task, this strategy also suffers from catastrophic forgetting due to an imbalance between old and new knowledge. To address this problem, we propose a novel replay strategy called Manifold Expansion Replay (MaER). We argue that expanding the implicit manifold of the knowledge representation in the episodic memory helps to improve the robustness and expressiveness of the model. To this end, we propose a greedy strategy to keep increasing the diameter of the implicit manifold represented by the knowledge in the buffer during memory management. In addition, we introduce Wasserstein distance instead of cross entropy as distillation loss to preserve previous knowledge. With extensive experimental validation on MNIST, CIFAR10, CIFAR100, and TinyImageNet, we show that the proposed method significantly improves the accuracy in continual learning setup, outperforming the state of the arts.
    MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. (arXiv:2309.10691v2 [cs.CL] UPDATED)
    To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4. We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback. (b) Better single-turn performance does not guarantee better multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.
    Explorable Mesh Deformation Subspaces from Unstructured Generative Models. (arXiv:2310.07814v1 [cs.GR])
    Exploring variations of 3D shapes is a time-consuming process in traditional 3D modeling tools. Deep generative models of 3D shapes often feature continuous latent spaces that can, in principle, be used to explore potential variations starting from a set of input shapes. In practice, doing so can be problematic: latent spaces are high dimensional and hard to visualize, contain shapes that are not relevant to the input shapes, and linear paths through them often lead to sub-optimal shape transitions. Furthermore, one would ideally be able to explore variations in the original high-quality meshes used to train the generative model, not its lower-quality output geometry. In this paper, we present a method to explore variations among a given set of landmark shapes by constructing a mapping from an easily-navigable 2D exploration space to a subspace of a pre-trained generative model. We first describe how to find a mapping that spans the set of input landmark shapes and exhibits smooth variations between them. We then show how to turn the variations in this subspace into deformation fields, to transfer those variations to high-quality meshes for the landmark shapes. Our results show that our method can produce visually-pleasing and easily-navigable 2D exploration spaces for several different shape categories, especially as compared to prior work on learning deformation spaces for 3D shapes.
    Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples. (arXiv:2310.07747v1 [cs.LG])
    Learning transparent, interpretable controllers with offline data in decision-making systems is an essential area of research due to its potential to reduce the risk of applications in real-world systems. However, in responsibility-sensitive settings such as healthcare, decision accountability is of paramount importance, yet has not been adequately addressed by the literature. This paper introduces the Accountable Offline Controller (AOC) that employs the offline dataset as the Decision Corpus and performs accountable control based on a tailored selection of examples, referred to as the Corpus Subset. ABC operates effectively in low-data scenarios, can be extended to the strictly offline imitation setting, and displays qualities of both conservation and adaptability. We assess ABC's performance in both simulated and real-world healthcare scenarios, emphasizing its capability to manage offline control tasks with high levels of performance while maintaining accountability. Keywords: Interpretable Reinforcement Learning, Explainable Reinforcement Learning, Reinforcement Learning Transparency, Offline Reinforcement Learning, Batched Control.
    A Complete Recipe for Diffusion Generative Models. (arXiv:2303.01748v2 [cs.LG] UPDATED)
    Score-based Generative Models (SGMs) have demonstrated exceptional synthesis outcomes across various tasks. However, the current design landscape of the forward diffusion process remains largely untapped and often relies on physical heuristics or simplifying assumptions. Utilizing insights from the development of scalable Bayesian posterior samplers, we present a complete recipe for formulating forward processes in SGMs, ensuring convergence to the desired target distribution. Our approach reveals that several existing SGMs can be seen as specific manifestations of our framework. Building upon this method, we introduce Phase Space Langevin Diffusion (PSLD), which relies on score-based modeling within an augmented space enriched by auxiliary variables akin to physical phase space. Empirical results exhibit the superior sample quality and improved speed-quality trade-off of PSLD compared to various competing approaches on established image synthesis benchmarks. Remarkably, PSLD achieves sample quality akin to state-of-the-art SGMs (FID: 2.10 for unconditional CIFAR-10 generation). Lastly, we demonstrate the applicability of PSLD in conditional synthesis using pre-trained score networks, offering an appealing alternative as an SGM backbone for future advancements. Code and model checkpoints can be accessed at \url{https://github.com/mandt-lab/PSLD}.
    Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders. (arXiv:2310.08164v1 [cs.LG])
    Large language models (LLMs) aligned to human preferences via reinforcement learning from human feedback (RLHF) underpin many commercial applications. However, how RLHF impacts LLM internals remains opaque. We propose a novel method to interpret learned reward functions in RLHF-tuned LLMs using sparse autoencoders. Our approach trains autoencoder sets on activations from a base LLM and its RLHF-tuned version. By comparing autoencoder hidden spaces, we identify unique features that reflect the accuracy of the learned reward model. To quantify this, we construct a scenario where the tuned LLM learns token-reward mappings to maximize reward. This is the first application of sparse autoencoders for interpreting learned rewards and broadly inspecting reward learning in LLMs. Our method provides an abstract approximation of reward integrity. This presents a promising technique for ensuring alignment between specified objectives and model behaviors.
    Score Regularized Policy Optimization through Diffusion Behavior. (arXiv:2310.07297v2 [cs.LG] UPDATED)
    Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
    Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling. (arXiv:2303.09033v2 [cs.LG] UPDATED)
    Most bandit algorithms assume that the reward variances or their upper bounds are known, and that they are the same for all arms. This naturally leads to suboptimal performance and higher regret due to variance overestimation. On the other hand, underestimated reward variances may lead to linear regret due to committing early to a suboptimal arm. This motivated prior works on variance-adaptive frequentist algorithms, which have strong instance-dependent regret bounds but cannot incorporate prior knowledge on reward variances. We lay foundations for the Bayesian setting, which incorporates prior knowledge. This results in lower regret in practice, due to using the prior in the algorithm design, and also improved regret guarantees. Specifically, we study Gaussian bandits with {unknown heterogeneous reward variances}, and develop a Thompson sampling algorithm with prior-dependent Bayes regret bounds. We achieve lower regret with lower reward variances and more informative priors on them, which is precisely why we pay only for what is uncertain. This is the first result of its kind. Finally, we corroborate our theory with extensive experiments, which show the superiority of our variance-adaptive Bayesian algorithm over prior frequentist approaches. We also show that our approach is robust to model misspecification and can be applied with estimated priors.
    Extreme Image Transformations Facilitate Robust Latent Object Representations. (arXiv:2310.07725v1 [cs.LG])
    Adversarial attacks can affect the object recognition capabilities of machines in wild. These can often result from spurious correlations between input and class labels, and are prone to memorization in large networks. While networks are expected to do automated feature selection, it is not effective at the scale of the object. Humans, however, are able to select the minimum set of features required to form a robust representation of an object. In this work, we show that finetuning any pretrained off-the-shelf network with Extreme Image Transformations (EIT) not only helps in learning a robust latent representation, it also improves the performance of these networks against common adversarial attacks of various intensities. Our EIT trained networks show strong activations in the object regions even when tested with more intense noise, showing promising generalizations across different kinds of adversarial attacks.
    Physics Constrained Unsupervised Deep Learning for Rapid, High Resolution Scanning Coherent Diffraction Reconstruction. (arXiv:2306.11014v2 [physics.comp-ph] UPDATED)
    By circumventing the resolution limitations of optics, coherent diffractive imaging (CDI) and ptychography are making their way into scientific fields ranging from X-ray imaging to astronomy. Yet, the need for time consuming iterative phase recovery hampers real-time imaging. While supervised deep learning strategies have increased reconstruction speed, they sacrifice image quality. Furthermore, these methods' demand for extensive labeled training data is experimentally burdensome. Here, we propose an unsupervised physics-informed neural network reconstruction method, PtychoPINN, that retains the factor of 100-to-1000 speedup of deep learning-based reconstruction while improving reconstruction quality by combining the diffraction forward map with real-space constraints from overlapping measurements. In particular, PtychoPINN significantly advances generalizability, accuracy (with a typical 10 dB PSNR increase), and linear resolution (2- to 6-fold gain). This blend of performance and speed offers exciting prospects for high-resolution real-time imaging in high-throughput environments such as X-ray free electron lasers (XFELs) and diffraction-limited light sources.
    In-Context Unlearning: Language Models as Few Shot Unlearners. (arXiv:2310.07579v2 [cs.LG] UPDATED)
    Machine unlearning, the study of efficiently removing the impact of specific training points on the trained model, has garnered increased attention of late, driven by the need to comply with privacy regulations like the Right to be Forgotten. Although unlearning is particularly relevant for LLMs in light of the copyright issues they raise, achieving precise unlearning is computationally infeasible for very large models. To this end, recent work has proposed several algorithms which approximate the removal of training data without retraining the model. These algorithms crucially rely on access to the model parameters in order to update them, an assumption that may not hold in practice due to computational constraints or when the LLM is accessed via API. In this work, we propose a new class of unlearning methods for LLMs we call ''In-Context Unlearning'', providing inputs in context and without having to update model parameters. To unlearn a particular training instance, we provide the instance alongside a flipped label and additional correctly labelled instances which are prepended as inputs to the LLM at inference time. Our experimental results demonstrate that these contexts effectively remove specific information from the training set while maintaining performance levels that are competitive with (or in some cases exceed) state-of-the-art unlearning methods that require access to the LLM parameters.
    Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments. (arXiv:2310.08204v1 [cs.CV])
    We present a lifelong audio-video masked autoencoder that continually learns the multimodal representations from a video stream containing audio-video pairs, while its distribution continually shifts over time. Specifically, we propose two novel ideas to tackle the problem: (1) Localized Alignment: We introduce a small trainable multimodal encoder that predicts the audio and video tokens that are well-aligned with each other. This allows the model to learn only the highly correlated audiovisual patches with accurate multimodal relationships. (2) Forget-robust multimodal patch selection: We compare the relative importance of each audio-video patch between the current and past data pair to mitigate unintended drift of the previously learned audio-video representations. Our proposed method, FLAVA (Forget-robust Localized Audio-Video Alignment), therefore, captures the complex relationships between the audio and video modalities during training on a sequence of pre-training tasks while alleviating the forgetting of learned audiovisual correlations. Our experiments validate that FLAVA outperforms the state-of-the-art continual learning methods on several benchmark datasets under continual audio-video representation learning scenarios.
    Impact of Co-occurrence on Factual Knowledge of Large Language Models. (arXiv:2310.08256v1 [cs.CL])
    Large language models (LLMs) often make factually incorrect responses despite their success in various applications. In this paper, we hypothesize that relying heavily on simple co-occurrence statistics of the pre-training corpora is one of the main factors that cause factual errors. Our results reveal that LLMs are vulnerable to the co-occurrence bias, defined as preferring frequently co-occurred words over the correct answer. Consequently, LLMs struggle to recall facts whose subject and object rarely co-occur in the pre-training dataset although they are seen during finetuning. We show that co-occurrence bias remains despite scaling up model sizes or finetuning. Therefore, we suggest finetuning on a debiased dataset to mitigate the bias by filtering out biased samples whose subject-object co-occurrence count is high. Although debiased finetuning allows LLMs to memorize rare facts in the training set, it is not effective in recalling rare facts unseen during finetuning. Further research in mitigation will help build reliable language models by preventing potential errors. The code is available at \url{https://github.com/CheongWoong/impact_of_cooccurrence}.
    Observatory: Characterizing Embeddings of Relational Tables. (arXiv:2310.07736v1 [cs.DB])
    Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage. To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze seven such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.
    Understanding Sparse Feature Updates in Deep Networks using Iterative Linearisation. (arXiv:2211.12345v4 [cs.LG] UPDATED)
    Larger and deeper networks generalise well despite their increased capacity to overfit. Understanding why this happens is theoretically and practically important. One recent approach looks at the infinitely wide limits of such networks and their corresponding kernels. However, these theoretical tools cannot fully explain finite networks as the empirical kernel changes significantly during gradient-descent-based training in contrast to infinite networks. In this work, we derive an iterative linearised training method as a novel empirical tool to further investigate this distinction, allowing us to control for sparse (i.e. infrequent) feature updates and quantify the frequency of feature learning needed to achieve comparable performance. We justify iterative linearisation as an interpolation between a finite analog of the infinite width regime, which does not learn features, and standard gradient descent training, which does. Informally, we also show that it is analogous to a damped version of the Gauss-Newton algorithm -- a second-order method. We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training, noting in particular how much less frequent feature learning is required to achieve comparable performance. We also show that feature learning is essential for good performance. Since such feature learning inevitably causes changes in the NTK kernel, we provide direct negative evidence for the NTK theory, which states the NTK kernel remains constant during training.
    Dealing with zero-inflated data: achieving SOTA with a two-fold machine learning approach. (arXiv:2310.08088v1 [cs.LG])
    In many cases, a machine learning model must learn to correctly predict a few data points with particular values of interest in a broader range of data where many target values are zero. Zero-inflated data can be found in diverse scenarios, such as lumpy and intermittent demands, power consumption for home appliances being turned on and off, impurities measurement in distillation processes, and even airport shuttle demand prediction. The presence of zeroes affects the models' learning and may result in poor performance. Furthermore, zeroes also distort the metrics used to compute the model's prediction quality. This paper showcases two real-world use cases (home appliances classification and airport shuttle demand prediction) where a hierarchical model applied in the context of zero-inflated data leads to excellent results. In particular, for home appliances classification, the weighted average of Precision, Recall, F1, and AUC ROC was increased by 27%, 34%, 49%, and 27%, respectively. Furthermore, it is estimated that the proposed approach is also four times more energy efficient than the SOTA approach against which it was compared to. Two-fold models performed best in all cases when predicting airport shuttle demand, and the difference against other models has been proven to be statistically significant.  ( 2 min )
    Invisible Threats: Backdoor Attack in OCR Systems. (arXiv:2310.08259v1 [cs.CR])
    Optical Character Recognition (OCR) is a widely used tool to extract text from scanned documents. Today, the state-of-the-art is achieved by exploiting deep neural networks. However, the cost of this performance is paid at the price of system vulnerability. For instance, in backdoor attacks, attackers compromise the training phase by inserting a backdoor in the victim's model that will be activated at testing time by specific patterns while leaving the overall model performance intact. This work proposes a backdoor attack for OCR resulting in the injection of non-readable characters from malicious input images. This simple but effective attack exposes the state-of-the-art OCR weakness, making the extracted text correct to human eyes but simultaneously unusable for the NLP application that uses OCR as a preprocessing step. Experimental results show that the attacked models successfully output non-readable characters for around 90% of the poisoned instances without harming their performance for the remaining instances.
    Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction. (arXiv:2107.14432v4 [cs.LG] UPDATED)
    We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum, Adagrad, Adam, AMSGrad, AdaHessian, and create a new class of optimizers, which are named Group Momentum, Group Adagrad, Group Adam, Group AMSGrad and Group AdaHessian, etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the regularized effect of our new optimizers on three large-scale real-world ad click datasets with state-of-the-art deep learning models. The experimental results reveal that compared with the original optimizers with the post-processing procedure which uses the magnitude pruning method, the performance of the models can be significantly improved on the same sparsity level. Furthermore, in comparison to the cases without magnitude pruning, our methods can achieve extremely high sparsity with significantly better or highly competitive performance. The code is available at https://github.com/intelligent-machine-learning/dlrover/blob/master/tfplus.  ( 3 min )
    GIO: Gradient Information Optimization for Training Dataset Selection. (arXiv:2306.11670v2 [cs.LG] UPDATED)
    It is often advantageous to train models on a subset of the available train examples, because the examples are of variable quality or because one would like to train with fewer examples, without sacrificing performance. We present Gradient Information Optimization (GIO), a scalable, task-agnostic approach to this data selection problem that requires only a small set of (unlabeled) examples representing a target distribution. GIO begins from a natural, information-theoretic objective that is intractable in practice. Our contribution is in showing that it can be made highly scalable through a simple relaxation of the objective and a highly efficient implementation. In experiments with machine translation, spelling correction, and image recognition, we show that GIO delivers outstanding results with very small train sets. These findings are robust to different representation models and hyperparameters for GIO itself. GIO is task- and domain-agnostic and can be applied out-of-the-box to new datasets and domains.
    A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks. (arXiv:2310.07891v1 [stat.ML])
    Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the loss, we demonstrate that these non-linear features can enhance learning.
    QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models. (arXiv:2310.08041v1 [cs.CL])
    Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.  ( 3 min )
    TriRE: A Multi-Mechanism Learning Paradigm for Continual Knowledge Retention and Promotion. (arXiv:2310.08217v1 [cs.AI])
    Continual learning (CL) has remained a persistent challenge for deep neural networks due to catastrophic forgetting (CF) of previously learned tasks. Several techniques such as weight regularization, experience rehearsal, and parameter isolation have been proposed to alleviate CF. Despite their relative success, these research directions have predominantly remained orthogonal and suffer from several shortcomings, while missing out on the advantages of competing strategies. On the contrary, the brain continually learns, accommodates, and transfers knowledge across tasks by simultaneously leveraging several neurophysiological processes, including neurogenesis, active forgetting, neuromodulation, metaplasticity, experience rehearsal, and context-dependent gating, rarely resulting in CF. Inspired by how the brain exploits multiple mechanisms concurrently, we propose TriRE, a novel CL paradigm that encompasses retaining the most prominent neurons for each task, revising and solidifying the extracted knowledge of current and past tasks, and actively promoting less active neurons for subsequent tasks through rewinding and relearning. Across CL settings, TriRE significantly reduces task interference and surpasses different CL approaches considered in isolation.  ( 2 min )
    Improving Fast Minimum-Norm Attacks with Hyperparameter Optimization. (arXiv:2310.08177v1 [cs.LG])
    Evaluating the adversarial robustness of machine learning models using gradient-based attacks is challenging. In this work, we show that hyperparameter optimization can improve fast minimum-norm attacks by automating the selection of the loss function, the optimizer and the step-size scheduler, along with the corresponding hyperparameters. Our extensive evaluation involving several robust models demonstrates the improved efficacy of fast minimum-norm attacks when hyper-up with hyperparameter optimization. We release our open-source code at https://github.com/pralab/HO-FMN.  ( 2 min )
    Data-Centric Learning from Unlabeled Graphs with Diffusion Model. (arXiv:2303.10108v2 [cs.LG] UPDATED)
    Graph property prediction tasks are important and numerous. While each task offers a small size of labeled examples, unlabeled graphs have been collected from various sources and at a large scale. A conventional approach is training a model with the unlabeled graphs on self-supervised tasks and then fine-tuning the model on the prediction tasks. However, the self-supervised task knowledge could not be aligned or sometimes conflicted with what the predictions needed. In this paper, we propose to extract the knowledge underlying the large set of unlabeled graphs as a specific set of useful data points to augment each property prediction model. We use a diffusion model to fully utilize the unlabeled graphs and design two new objectives to guide the model's denoising process with each task's labeled data to generate task-specific graph examples and their labels. Experiments demonstrate that our data-centric approach performs significantly better than fifteen existing various methods on fifteen tasks. The performance improvement brought by unlabeled data is visible as the generated labeled examples unlike the self-supervised learning.  ( 2 min )
    L2P: Learning to Place for Estimating Heavy-Tailed Distributed Outcomes. (arXiv:1908.04628v3 [cs.LG] UPDATED)
    Many real-world prediction tasks have outcome variables that have characteristic heavy-tail distributions. Examples include copies of books sold, auction prices of art pieces, demand for commodities in warehouses, etc. By learning heavy-tailed distributions, "big and rare" instances (e.g., the best-sellers) will have accurate predictions. Most existing approaches are not dedicated to learning heavy-tailed distribution; thus, they heavily under-predict such instances. To tackle this problem, we introduce Learning to Place (L2P), which exploits the pairwise relationships between instances for learning. In its training phase, L2P learns a pairwise preference classifier: is instance A > instance B? In its placing phase, L2P obtains a prediction by placing the new instance among the known instances. Based on its placement, the new instance is then assigned a value for its outcome variable. Experiments on real data show that L2P outperforms competing approaches in terms of accuracy and ability to reproduce heavy-tailed outcome distribution. In addition, L2P provides an interpretable model by placing each predicted instance in relation to its comparable neighbors. Interpretable models are highly desirable when lives and treasure are at stake.
    ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking. (arXiv:2310.08061v1 [q-bio.BM])
    Predicting the docking between proteins and ligands is a crucial and challenging task for drug discovery. However, traditional docking methods mainly rely on scoring functions, and deep learning-based docking approaches usually neglect the 3D spatial information of proteins and ligands, as well as the graph-level features of ligands, which limits their performance. To address these limitations, we propose an equivariant transformer neural network for protein-ligand docking pose prediction. Our approach involves the fusion of ligand graph-level features by feature processing, followed by the learning of ligand and protein representations using our proposed TAMformer module. Additionally, we employ an iterative optimization approach based on the predicted distance matrix to generate refined ligand poses. The experimental results on real datasets show that our model can achieve state-of-the-art performance.  ( 2 min )
    Lag-Llama: Towards Foundation Models for Time Series Forecasting. (arXiv:2310.08278v1 [cs.LG])
    Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen "out-of-distribution" time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. The open source code is made available at https://github.com/kashif/pytorch-transformer-ts.
    Rethinking Large-scale Pre-ranking System: Entire-chain Cross-domain Models. (arXiv:2310.08039v1 [cs.IR])
    Industrial systems such as recommender systems and online advertising, have been widely equipped with multi-stage architectures, which are divided into several cascaded modules, including matching, pre-ranking, ranking and re-ranking. As a critical bridge between matching and ranking, existing pre-ranking approaches mainly endure sample selection bias (SSB) problem owing to ignoring the entire-chain data dependence, resulting in sub-optimal performances. In this paper, we rethink pre-ranking system from the perspective of the entire sample space, and propose Entire-chain Cross-domain Models (ECM), which leverage samples from the whole cascaded stages to effectively alleviate SSB problem. Besides, we design a fine-grained neural structure named ECMM to further improve the pre-ranking accuracy. Specifically, we propose a cross-domain multi-tower neural network to comprehensively predict for each stage result, and introduce the sub-networking routing strategy with $L0$ regularization to reduce computational costs. Evaluations on real-world large-scale traffic logs demonstrate that our pre-ranking models outperform SOTA methods while time consumption is maintained within an acceptable level, which achieves better trade-off between efficiency and effectiveness.
    LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios. (arXiv:2310.08348v1 [cs.LG])
    Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari. However, it has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications, especially when these environments involve complex action spaces and significant simulation costs, or inherent stochasticity. In this work, we introduce LightZero, the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios. Specificially, we summarize the most critical challenges in designing a general MCTS-style decision-making solver, then decompose the tightly-coupled algorithm and system design of tree-search RL methods into distinct sub-modules. By incorporating more appropriate exploration and optimization strategies, we can significantly enhance these sub-modules and construct powerful LightZero agents to tackle tasks across a wide range of domains, such as board games, Atari, MuJoCo, MiniGrid and GoBigger. Detailed benchmark results reveal the significant potential of such methods in building scalable and efficient decision intelligence. The code is available as part of OpenDILab at https://github.com/opendilab/LightZero.  ( 2 min )
    Why Train More? Effective and Efficient Membership Inference via Memorization. (arXiv:2310.08015v1 [cs.LG])
    Membership Inference Attacks (MIAs) aim to identify specific data samples within the private training dataset of machine learning models, leading to serious privacy violations and other sophisticated threats. Many practical black-box MIAs require query access to the data distribution (the same distribution where the private data is drawn) to train shadow models. By doing so, the adversary obtains models trained "with" or "without" samples drawn from the distribution, and analyzes the characteristics of the samples under consideration. The adversary is often required to train more than hundreds of shadow models to extract the signals needed for MIAs; this becomes the computational overhead of MIAs. In this paper, we propose that by strategically choosing the samples, MI adversaries can maximize their attack success while minimizing the number of shadow models. First, our motivational experiments suggest memorization as the key property explaining disparate sample vulnerability to MIAs. We formalize this through a theoretical bound that connects MI advantage with memorization. Second, we show sample complexity bounds that connect the number of shadow models needed for MIAs with memorization. Lastly, we confirm our theoretical arguments with comprehensive experiments; by utilizing samples with high memorization scores, the adversary can (a) significantly improve its efficacy regardless of the MIA used, and (b) reduce the number of shadow models by nearly two orders of magnitude compared to state-of-the-art approaches.  ( 2 min )
    NeRF2: Neural Radio-Frequency Radiance Fields. (arXiv:2305.06118v2 [cs.NI] UPDATED)
    Although Maxwell discovered the physical laws of electromagnetic waves 160 years ago, how to precisely model the propagation of an RF signal in an electrically large and complex environment remains a long-standing problem. The difficulty is in the complex interactions between the RF signal and the obstacles (e.g., reflection, diffraction, etc.). Inspired by the great success of using a neural network to describe the optical field in computer vision, we propose a neural radio-frequency radiance field, NeRF$^\textbf{2}$, which represents a continuous volumetric scene function that makes sense of an RF signal's propagation. Particularly, after training with a few signal measurements, NeRF$^\textbf{2}$ can tell how/what signal is received at any position when it knows the position of a transmitter. As a physical-layer neural network, NeRF$^\textbf{2}$ can take advantage of the learned statistic model plus the physical model of ray tracing to generate a synthetic dataset that meets the training demands of application-layer artificial neural networks (ANNs). Thus, we can boost the performance of ANNs by the proposed turbo-learning, which mixes the true and synthetic datasets to intensify the training. Our experiment results show that turbo-learning can enhance performance with an approximate 50% increase. We also demonstrate the power of NeRF$^\textbf{2}$ in the field of indoor localization and 5G MIMO.
    Quasi-Arithmetic Mixtures, Divergence Minimization, and Bregman Information. (arXiv:2209.07481v2 [cs.LG] UPDATED)
    Markov Chain Monte Carlo methods for sampling from complex distributions and estimating normalization constants often simulate samples from a sequence of intermediate distributions along an annealing path, which bridges between a tractable initial distribution and a target density of interest. Prior work has constructed annealing paths using quasi-arithmetic means, and interpreted the resulting intermediate densities as minimizing an expected divergence to the endpoints. We provide a comprehensive analysis of this 'centroid' property using Bregman divergences under a monotonic embedding of the density function, thereby associating common divergences such as Amari's and Renyi's ${\alpha}$-divergences, ${(\alpha,\beta)}$-divergences, and the Jensen-Shannon divergence with intermediate densities along an annealing path. Our analysis highlights the interplay between parametric families, quasi-arithmetic means, and divergence functions using the rho-tau Bregman divergence framework of Zhang 2004,2013.
    MemSAC: Memory Augmented Sample Consistency for Large Scale Unsupervised Domain Adaptation. (arXiv:2207.12389v2 [cs.CV] UPDATED)
    Practical real world datasets with plentiful categories introduce new challenges for unsupervised domain adaptation like small inter-class discriminability, that existing approaches relying on domain invariance alone cannot handle sufficiently well. In this work we propose MemSAC, which exploits sample level similarity across source and target domains to achieve discriminative transfer, along with architectures that scale to a large number of categories. For this purpose, we first introduce a memory augmented approach to efficiently extract pairwise similarity relations between labeled source and unlabeled target domain instances, suited to handle an arbitrary number of classes. Next, we propose and theoretically justify a novel variant of the contrastive loss to promote local consistency among within-class cross domain samples while enforcing separation between classes, thus preserving discriminative transfer from source to target. We validate the advantages of MemSAC with significant improvements over previous state-of-the-art on multiple challenging transfer tasks designed for large-scale adaptation, such as DomainNet with 345 classes and fine-grained adaptation on Caltech-UCSD birds dataset with 200 classes. We also provide in-depth analysis and insights into the effectiveness of MemSAC.
    Learning Joint Latent Space EBM Prior Model for Multi-layer Generator. (arXiv:2306.06323v2 [cs.CV] UPDATED)
    This paper studies the fundamental problem of learning multi-layer generator models. The multi-layer generator model builds multiple layers of latent variables as a prior model on top of the generator, which benefits learning complex data distribution and hierarchical representations. However, such a prior model usually focuses on modeling inter-layer relations between latent variables by assuming non-informative (conditional) Gaussian distributions, which can be limited in model expressivity. To tackle this issue and learn more expressive prior models, we propose an energy-based model (EBM) on the joint latent space over all layers of latent variables with the multi-layer generator as its backbone. Such joint latent space EBM prior model captures the intra-layer contextual relations at each layer through layer-wise energy terms, and latent variables across different layers are jointly corrected. We develop a joint training scheme via maximum likelihood estimation (MLE), which involves Markov Chain Monte Carlo (MCMC) sampling for both prior and posterior distributions of the latent variables from different layers. To ensure efficient inference and learning, we further propose a variational training scheme where an inference model is used to amortize the costly posterior MCMC sampling. Our experiments demonstrate that the learned model can be expressive in generating high-quality images and capturing hierarchical features for better outlier detection.  ( 2 min )
    Precise localization within the GI tract by combining classification of CNNs and time-series analysis of HMMs. (arXiv:2310.07895v1 [cs.LG])
    This paper presents a method to efficiently classify the gastroenterologic section of images derived from Video Capsule Endoscopy (VCE) studies by exploring the combination of a Convolutional Neural Network (CNN) for classification with the time-series analysis properties of a Hidden Markov Model (HMM). It is demonstrated that successive time-series analysis identifies and corrects errors in the CNN output. Our approach achieves an accuracy of $98.04\%$ on the Rhode Island (RI) Gastroenterology dataset. This allows for precise localization within the gastrointestinal (GI) tract while requiring only approximately 1M parameters and thus, provides a method suitable for low power devices  ( 2 min )
    Participatory Personalization in Classification. (arXiv:2302.03874v2 [cs.LG] UPDATED)
    Machine learning models are often personalized with information that is protected, sensitive, self-reported, or costly to acquire. These models use information about people but do not facilitate nor inform their consent. Individuals cannot opt out of reporting personal information to a model, nor tell if they benefit from personalization in the first place. We introduce a family of classification models, called participatory systems, that let individuals opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for personalization with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, benchmarking them with common approaches for personalization and imputation. Our results demonstrate that participatory systems can facilitate and inform consent while improving performance and data use across all groups who report personal data.
    Efficient Hyperdimensional Computing. (arXiv:2301.10902v2 [cs.LG] UPDATED)
    Hyperdimensional computing (HDC) is a method to perform classification that uses binary vectors with high dimensions and the majority rule. This approach has the potential to be energy-efficient and hence deemed suitable for resource-limited platforms due to its simplicity and massive parallelism. However, in order to achieve high accuracy, HDC sometimes uses hypervectors with tens of thousands of dimensions. This potentially negates its efficiency advantage. In this paper, we examine the necessity of such high dimensions and conduct a detailed theoretical analysis of the relationship between hypervector dimensions and accuracy. Our results demonstrate that as the dimension of the hypervectors increases, the worst-case/average-case HDC prediction accuracy with the majority rule decreases. Building on this insight, we develop HDC models that use binary hypervectors with dimensions orders of magnitude lower than those of state-of-the-art HDC models while maintaining equivalent or even improved accuracy and efficiency. For instance, on the MNIST dataset, we achieve 91.12% HDC accuracy in image classification with a dimension of only 64. Our methods perform operations that are only 0.35% of other HDC models with dimensions of 10,000. Furthermore, we evaluate our methods on ISOLET, UCI-HAR, and Fashion-MNIST datasets and investigate the limits of HDC computing.  ( 2 min )
    Does Synthetic Data Make Large Language Models More Efficient?. (arXiv:2310.07830v1 [cs.CL])
    Natural Language Processing (NLP) has undergone transformative changes with the advent of deep learning methodologies. One challenge persistently confronting researchers is the scarcity of high-quality, annotated datasets that drive these models. This paper explores the nuances of synthetic data generation in NLP, with a focal point on template-based question generation. By assessing its advantages, including data augmentation potential and the introduction of structured variety, we juxtapose these benefits against inherent limitations, such as the risk of overfitting and the constraints posed by pre-defined templates. Drawing from empirical evaluations, we demonstrate the impact of template-based synthetic data on the performance of modern transformer models. We conclude by emphasizing the delicate balance required between synthetic and real-world data, and the future trajectories of integrating synthetic data in model training pipelines. The findings aim to guide NLP practitioners in harnessing synthetic data's potential, ensuring optimal model performance in diverse applications.  ( 2 min )
    Spiral-Elliptical automated galaxy morphology classification from telescope images. (arXiv:2310.07740v1 [astro-ph.IM])
    The classification of galaxy morphologies is an important step in the investigation of theories of hierarchical structure formation. While human expert visual classification remains quite effective and accurate, it cannot keep up with the massive influx of data from emerging sky surveys. A variety of approaches have been proposed to classify large numbers of galaxies; these approaches include crowdsourced visual classification, and automated and computational methods, such as machine learning methods based on designed morphology statistics and deep learning. In this work, we develop two novel galaxy morphology statistics, descent average and descent variance, which can be efficiently extracted from telescope galaxy images. We further propose simplified versions of the existing image statistics concentration, asymmetry, and clumpiness, which have been widely used in the literature of galaxy morphologies. We utilize the galaxy image data from the Sloan Digital Sky Survey to demonstrate the effective performance of our proposed image statistics at accurately detecting spiral and elliptical galaxies when used as features of a random forest classifier.  ( 2 min )
    Joint Metrics Matter: A Better Standard for Trajectory Forecasting. (arXiv:2305.06292v2 [cs.RO] UPDATED)
    Multi-modal trajectory forecasting methods commonly evaluate using single-agent metrics (marginal metrics), such as minimum Average Displacement Error (ADE) and Final Displacement Error (FDE), which fail to capture joint performance of multiple interacting agents. Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for people who are clearly walking together as a group. Consequently, methods optimized for marginal metrics lead to overly-optimistic estimations of performance, which is detrimental to progress in trajectory forecasting research. In response to the limitations of marginal metrics, we present the first comprehensive evaluation of state-of-the-art (SOTA) trajectory forecasting methods with respect to multi-agent metrics (joint metrics): JADE, JFDE, and collision rate. We demonstrate the importance of joint metrics as opposed to marginal metrics with quantitative evidence and qualitative examples drawn from the ETH / UCY and Stanford Drone datasets. We introduce a new loss function incorporating joint metrics that, when applied to a SOTA trajectory forecasting method, achieves a 7\% improvement in JADE / JFDE on the ETH / UCY datasets with respect to the previous SOTA. Our results also indicate that optimizing for joint metrics naturally leads to an improvement in interaction modeling, as evidenced by a 16\% decrease in mean collision rate on the ETH / UCY datasets with respect to the previous SOTA. Code is available at \texttt{\hyperlink{https://github.com/ericaweng/joint-metrics-matter}{github.com/ericaweng/joint-metrics-matter}}.  ( 3 min )
    DeePref: Deep Reinforcement Learning For Video Prefetching In Content Delivery Networks. (arXiv:2310.07881v1 [cs.NI])
    Content Delivery Networks carry the majority of Internet traffic, and the increasing demand for video content as a major IP traffic across the Internet highlights the importance of caching and prefetching optimization algorithms. Prefetching aims to make data available in the cache before the requester places its request to reduce access time and improve the Quality of Experience on the user side. Prefetching is well investigated in operating systems, compiler instructions, in-memory cache, local storage systems, high-speed networks, and cloud systems. Traditional prefetching techniques are well adapted to a particular access pattern, but fail to adapt to sudden variations or randomization in workloads. This paper explores the use of reinforcement learning to tackle the changes in user access patterns and automatically adapt over time. To this end, we propose, DeePref, a Deep Reinforcement Learning agent for online video content prefetching in Content Delivery Networks. DeePref is a prefetcher implemented on edge networks and is agnostic to hardware design, operating systems, and applications. Our results show that DeePref DRQN, using a real-world dataset, achieves a 17% increase in prefetching accuracy and a 28% increase in prefetching coverage on average compared to baseline approaches that use video content popularity as a building block to statically or dynamically make prefetching decisions. We also study the possibility of transfer learning of statistical models from one edge network into another, where unseen user requests from unknown distribution are observed. In terms of transfer learning, the increase in prefetching accuracy and prefetching coverage are [$30%$, $10%$], respectively. Our source code will be available on Github.  ( 3 min )
    Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling. (arXiv:2310.07786v1 [cs.LG])
    Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends. While a number of non-stationary contextual bandit learning algorithms have been proposed in the literature, they excessively explore due to a lack of prioritization for information of enduring value, or are designed in ways that do not scale in modern applications with high-dimensional user-specific features and large action set, or both. In this paper, we introduce a novel non-stationary contextual bandit algorithm that addresses these concerns. It combines a scalable, deep-neural-network-based architecture with a carefully designed exploration mechanism that strategically prioritizes collecting information with the most lasting value in a non-stationary environment. Through empirical evaluations on two real-world recommendation datasets, which exhibit pronounced non-stationarity, we demonstrate that our approach significantly outperforms the state-of-the-art baselines.  ( 2 min )
    Multi-Scale Spatial-Temporal Recurrent Networks for Traffic Flow Prediction. (arXiv:2310.08138v1 [cs.LG])
    Traffic flow prediction is one of the most fundamental tasks of intelligent transportation systems. The complex and dynamic spatial-temporal dependencies make the traffic flow prediction quite challenging. Although existing spatial-temporal graph neural networks hold prominent, they often encounter challenges such as (1) ignoring the fixed graph that limits the predictive performance of the model, (2) insufficiently capturing complex spatial-temporal dependencies simultaneously, and (3) lacking attention to spatial-temporal information at different time lengths. In this paper, we propose a Multi-Scale Spatial-Temporal Recurrent Network for traffic flow prediction, namely MSSTRN, which consists of two different recurrent neural networks: the single-step gate recurrent unit and the multi-step gate recurrent unit to fully capture the complex spatial-temporal information in the traffic data under different time windows. Moreover, we propose a spatial-temporal synchronous attention mechanism that integrates adaptive position graph convolutions into the self-attention mechanism to achieve synchronous capture of spatial-temporal dependencies. We conducted extensive experiments on four real traffic datasets and demonstrated that our model achieves the best prediction accuracy with non-trivial margins compared to all the twenty baseline methods.  ( 2 min )
    Robust 1-bit Compressed Sensing with Iterative Hard Thresholding. (arXiv:2310.08019v1 [cs.IT])
    In 1-bit compressed sensing, the aim is to estimate a $k$-sparse unit vector $x\in S^{n-1}$ within an $\epsilon$ error (in $\ell_2$) from minimal number of linear measurements that are quantized to just their signs, i.e., from measurements of the form $y = \mathrm{Sign}(\langle a, x\rangle).$ In this paper, we study a noisy version where a fraction of the measurements can be flipped, potentially by an adversary. In particular, we analyze the Binary Iterative Hard Thresholding (BIHT) algorithm, a proximal gradient descent on a properly defined loss function used for 1-bit compressed sensing, in this noisy setting. It is known from recent results that, with $\tilde{O}(\frac{k}{\epsilon})$ noiseless measurements, BIHT provides an estimate within $\epsilon$ error. This result is optimal and universal, meaning one set of measurements work for all sparse vectors. In this paper, we show that BIHT also provides better results than all known methods for the noisy setting. We show that when up to $\tau$-fraction of the sign measurements are incorrect (adversarial error), with the same number of measurements as before, BIHT agnostically provides an estimate of $x$ within an $\tilde{O}(\epsilon+\tau)$ error, maintaining the universality of measurements. This establishes stability of iterative hard thresholding in the presence of measurement error. To obtain the result, we use the restricted approximate invertibility of Gaussian matrices, as well as a tight analysis of the high-dimensional geometry of the adversarially corrupted measurements.  ( 3 min )
    Relaxing the Additivity Constraints in Decentralized No-Regret High-Dimensional Bayesian Optimization. (arXiv:2305.19838v2 [cs.LG] UPDATED)
    Bayesian Optimization (BO) is typically used to optimize an unknown function $f$ that is noisy and costly to evaluate, by exploiting an acquisition function that must be maximized at each optimization step. Even if provably asymptotically optimal BO algorithms are efficient at optimizing low-dimensional functions, scaling them to high-dimensional spaces remains an open problem, often tackled by assuming an additive structure for $f$. By doing so, BO algorithms typically introduce additional restrictive assumptions on the additive structure that reduce their applicability domain. This paper contains two main contributions: (i) we relax the restrictive assumptions on the additive structure of $f$, at the expense of weakening the maximization guarantees of the acquisition function, and (ii) we address the over-exploration problem for decentralized BO algorithms. To these ends, we propose DumBO, an asymptotically optimal decentralized BO algorithm that achieves very competitive performance against state-of-the-art BO algorithms, especially when the additive structure of $f$ comprises high-dimensional factors.  ( 2 min )
    Counterfactual Explanations for Time Series Forecasting. (arXiv:2310.08137v1 [cs.LG])
    Among recent developments in time series forecasting methods, deep forecasting models have gained popularity as they can utilize hidden feature patterns in time series to improve forecasting performance. Nevertheless, the majority of current deep forecasting models are opaque, hence making it challenging to interpret the results. While counterfactual explanations have been extensively employed as a post-hoc approach for explaining classification models, their application to forecasting models still remains underexplored. In this paper, we formulate the novel problem of counterfactual generation for time series forecasting, and propose an algorithm, called ForecastCF, that solves the problem by applying gradient-based perturbations to the original time series. ForecastCF guides the perturbations by applying constraints to the forecasted values to obtain desired prediction outcomes. We experimentally evaluate ForecastCF using four state-of-the-art deep model architectures and compare to two baselines. Our results show that ForecastCF outperforms the baseline in terms of counterfactual validity and data manifold closeness. Overall, our findings suggest that ForecastCF can generate meaningful and relevant counterfactual explanations for various forecasting tasks.  ( 2 min )
    Language Models As Semantic Indexers. (arXiv:2310.07815v1 [cs.IR])
    Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. Nevertheless, it is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMINDEXER, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. The learned semantic indexer can facilitate various downstream tasks, such as recommendation and retrieval. We conduct experiments on three tasks including recommendation, product search, and document retrieval on five datasets from various domains, where LMINDEXER outperforms competitive baselines significantly and consistently.
    Federated Learning from Small Datasets. (arXiv:2110.03469v3 [cs.LG] UPDATED)
    Federated learning allows multiple parties to collaboratively train a joint model without sharing local data. This enables applications of machine learning in settings of inherently distributed, undisclosable data such as in the medical domain. In practice, joint training is usually achieved by aggregating local models, for which local training objectives have to be in expectation similar to the joint (global) objective. Often, however, local datasets are so small that local objectives differ greatly from the global objective, resulting in federated learning to fail. We propose a novel approach that intertwines model aggregations with permutations of local models. The permutations expose each local model to a daisy chain of local datasets resulting in more efficient training in data-sparse domains. This enables training on extremely small local datasets, such as patient data across hospitals, while retaining the training efficiency and privacy benefits of federated learning.
    Interpretable Diffusion via Information Decomposition. (arXiv:2310.07972v1 [cs.LG])
    Denoising diffusion models enable conditional generation and density modeling of complex relationships like images and text. However, the nature of the learned relationships is opaque making it difficult to understand precisely what relationships between words and parts of an image are captured, or to predict the effect of an intervention. We illuminate the fine-grained relationships learned by diffusion models by noticing a precise relationship between diffusion and information decomposition. Exact expressions for mutual information and conditional mutual information can be written in terms of the denoising model. Furthermore, pointwise estimates can be easily estimated as well, allowing us to ask questions about the relationships between specific images and captions. Decomposing information even further to understand which variables in a high-dimensional space carry information is a long-standing problem. For diffusion models, we show that a natural non-negative decomposition of mutual information emerges, allowing us to quantify informative relationships between words and pixels in an image. We exploit these new relations to measure the compositional understanding of diffusion models, to do unsupervised localization of objects in images, and to measure effects when selectively editing images through prompt interventions.
    ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction Targets. (arXiv:2310.08096v1 [cs.LG])
    Public and private actors struggle to assess the vast amounts of information about sustainability commitments made by various institutions. To address this problem, we create a novel tool for automatically detecting corporate, national, and regional net zero and reduction targets in three steps. First, we introduce an expert-annotated data set with 3.5K text samples. Second, we train and release ClimateBERT-NetZero, a natural language classifier to detect whether a text contains a net zero or reduction target. Third, we showcase its analysis potential with two use cases: We first demonstrate how ClimateBERT-NetZero can be combined with conventional question-answering (Q&A) models to analyze the ambitions displayed in net zero and reduction targets. Furthermore, we employ the ClimateBERT-NetZero model on quarterly earning call transcripts and outline how communication patterns evolve over time. Our experiments demonstrate promising pathways for extracting and analyzing net zero and emission reduction targets at scale.
    ZEST: Attention-based Zero-Shot Learning for Unseen IoT Device Classification. (arXiv:2310.08036v1 [cs.NI])
    Recent research works have proposed machine learning models for classifying IoT devices connected to a network. However, there is still a practical challenge of not having all devices (and hence their traffic) available during the training of a model. This essentially means, during the operational phase, we need to classify new devices not seen during the training phase. To address this challenge, we propose ZEST -- a ZSL (zero-shot learning) framework based on self-attention for classifying both seen and unseen devices. ZEST consists of i) a self-attention based network feature extractor, termed SANE, for extracting latent space representations of IoT traffic, ii) a generative model that trains a decoder using latent features to generate pseudo data, and iii) a supervised model that is trained on the generated pseudo data for classifying devices. We carry out extensive experiments on real IoT traffic data; our experiments demonstrate i) ZEST achieves significant improvement (in terms of accuracy) over the baselines; ii) ZEST is able to better extract meaningful representations than LSTM which has been commonly used for modeling network traffic.  ( 2 min )
    CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrapping. (arXiv:2310.07855v1 [cs.CV])
    Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrapping can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrapping approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrapping method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrapping throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models will be publicly available upon acceptance.  ( 2 min )
    CleftGAN: Adapting A Style-Based Generative Adversarial Network To Create Images Depicting Cleft Lip Deformity. (arXiv:2310.07969v1 [cs.CV])
    A major obstacle when attempting to train a machine learning system to evaluate facial clefts is the scarcity of large datasets of high-quality, ethics board-approved patient images. In response, we have built a deep learning-based cleft lip generator designed to produce an almost unlimited number of artificial images exhibiting high-fidelity facsimiles of cleft lip with wide variation. We undertook a transfer learning protocol testing different versions of StyleGAN-ADA (a generative adversarial network image generator incorporating adaptive data augmentation (ADA)) as the base model. Training images depicting a variety of cleft deformities were pre-processed to adjust for rotation, scaling, color adjustment and background blurring. The ADA modification of the primary algorithm permitted construction of our new generative model while requiring input of a relatively small number of training images. Adversarial training was carried out using 514 unique frontal photographs of cleft-affected faces to adapt a pre-trained model based on 70,000 normal faces. The Frechet Inception Distance (FID) was used to measure the similarity of the newly generated facial images to the cleft training dataset, while Perceptual Path Length (PPL) and the novel Divergence Index of Severity Histograms (DISH) measures were also used to assess the performance of the image generator that we dub CleftGAN. We found that StyleGAN3 with translation invariance (StyleGAN3-t) performed optimally as a base model. Generated images achieved a low FID reflecting a close similarity to our training input dataset of genuine cleft images. Low PPL and DISH measures reflected a smooth and semantically valid interpolation of images through the transfer learning process and a similar distribution of severity in the training and generated images, respectively.  ( 3 min )
    Beyond Traditional DoE: Deep Reinforcement Learning for Optimizing Experiments in Model Identification of Battery Dynamics. (arXiv:2310.08198v1 [cs.LG])
    Model identification of battery dynamics is a central problem in energy research; many energy management systems and design processes rely on accurate battery models for efficiency optimization. The standard methodology for battery modelling is traditional design of experiments (DoE), where the battery dynamics are excited with many different current profiles and the measured outputs are used to estimate the system dynamics. However, although it is possible to obtain useful models with the traditional approach, the process is time consuming and expensive because of the need to sweep many different current-profile configurations. In the present work, a novel DoE approach is developed based on deep reinforcement learning, which alters the configuration of the experiments on the fly based on the statistics of past experiments. Instead of sticking to a library of predefined current profiles, the proposed approach modifies the current profiles dynamically by updating the output space covered by past measurements, hence only the current profiles that are informative for future experiments are applied. Simulations and real experiments are used to show that the proposed approach gives models that are as accurate as those obtained with traditional DoE but by using 85\% less resources.  ( 2 min )
    Cost-Driven Hardware-Software Co-Optimization of Machine Learning Pipelines. (arXiv:2310.07940v1 [cs.LG])
    Researchers have long touted a vision of the future enabled by a proliferation of internet-of-things devices, including smart sensors, homes, and cities. Increasingly, embedding intelligence in such devices involves the use of deep neural networks. However, their storage and processing requirements make them prohibitive for cheap, off-the-shelf platforms. Overcoming those requirements is necessary for enabling widely-applicable smart devices. While many ways of making models smaller and more efficient have been developed, there is a lack of understanding of which ones are best suited for particular scenarios. More importantly for edge platforms, those choices cannot be analyzed in isolation from cost and user experience. In this work, we holistically explore how quantization, model scaling, and multi-modality interact with system components such as memory, sensors, and processors. We perform this hardware/software co-design from the cost, latency, and user-experience perspective, and develop a set of guidelines for optimal system design and model deployment for the most cost-constrained platforms. We demonstrate our approach using an end-to-end, on-device, biometric user authentication system using a $20 ESP-EYE board.  ( 2 min )
    CHIP: Contrastive Hierarchical Image Pretraining. (arXiv:2310.08304v1 [cs.CV])
    Few-shot object classification is the task of classifying objects in an image with limited number of examples as supervision. We propose a one-shot/few-shot classification model that can classify an object of any unseen class into a relatively general category in an hierarchically based classification. Our model uses a three-level hierarchical contrastive loss based ResNet152 classifier for classifying an object based on its features extracted from Image embedding, not used during the training phase. For our experimentation, we have used a subset of the ImageNet (ILSVRC-12) dataset that contains only the animal classes for training our model and created our own dataset of unseen classes for evaluating our trained model. Our model provides satisfactory results in classifying the unknown objects into a generic category which has been later discussed in greater detail.  ( 2 min )
    On the Computational Complexity of Private High-dimensional Model Selection via the Exponential Mechanism. (arXiv:2310.07852v1 [stat.ML])
    We consider the problem of model selection in a high-dimensional sparse linear regression model under the differential privacy framework. In particular, we consider the problem of differentially private best subset selection and study its utility guarantee. We adopt the well-known exponential mechanism for selecting the best model, and under a certain margin condition, we establish its strong model recovery property. However, the exponential search space of the exponential mechanism poses a serious computational bottleneck. To overcome this challenge, we propose a Metropolis-Hastings algorithm for the sampling step and establish its polynomial mixing time to its stationary distribution in the problem parameters $n,p$, and $s$. Furthermore, we also establish approximate differential privacy for the final estimates of the Metropolis-Hastings random walk using its mixing property. Finally, we also perform some illustrative simulations that echo the theoretical findings of our main results.  ( 2 min )
    The Expresssive Power of Transformers with Chain of Thought. (arXiv:2310.07923v1 [cs.LG])
    Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.  ( 2 min )
    Emulating the dynamics of complex systems using autoregressive models on manifolds (mNARX). (arXiv:2306.16335v2 [stat.CO] UPDATED)
    We propose a novel surrogate modelling approach to efficiently and accurately approximate the response of complex dynamical systems driven by time-varying exogenous excitations over extended time periods. Our approach, namely manifold nonlinear autoregressive modelling with exogenous input (mNARX), involves constructing a problem-specific exogenous input manifold that is optimal for constructing autoregressive surrogates. The manifold, which forms the core of mNARX, is constructed incrementally by incorporating the physics of the system, as well as prior expert- and domain- knowledge. Because mNARX decomposes the full problem into a series of smaller sub-problems, each with a lower complexity than the original, it scales well with the complexity of the problem, both in terms of training and evaluation costs of the final surrogate. Furthermore, mNARX synergizes well with traditional dimensionality reduction techniques, making it highly suitable for modelling dynamical systems with high-dimensional exogenous inputs, a class of problems that is typically challenging to solve. Since domain knowledge is particularly abundant in physical systems, such as those found in civil and mechanical engineering, mNARX is well suited for these applications. We demonstrate that mNARX outperforms traditional autoregressive surrogates in predicting the response of a classical coupled spring-mass system excited by a one-dimensional random excitation. Additionally, we show that mNARX is well suited for emulating very high-dimensional time- and state-dependent systems, even when affected by active controllers, by surrogating the dynamics of a realistic aero-servo-elastic onshore wind turbine simulator. In general, our results demonstrate that mNARX offers promising prospects for modelling complex dynamical systems, in terms of accuracy and efficiency.  ( 3 min )
    Leader-Follower Neural Networks with Local Error Signals Inspired by Complex Collectives. (arXiv:2310.07885v1 [cs.LG])
    The collective behavior of a network with heterogeneous, resource-limited information processing units (e.g., group of fish, flock of birds, or network of neurons) demonstrates high self-organization and complexity. These emergent properties arise from simple interaction rules where certain individuals can exhibit leadership-like behavior and influence the collective activity of the group. Motivated by the intricacy of these collectives, we propose a neural network (NN) architecture inspired by the rules observed in nature's collective ensembles. This NN structure contains workers that encompass one or more information processing units (e.g., neurons, filters, layers, or blocks of layers). Workers are either leaders or followers, and we train a leader-follower neural network (LFNN) by leveraging local error signals and optionally incorporating backpropagation (BP) and global loss. We investigate worker behavior and evaluate LFNNs through extensive experimentation. Our LFNNs trained with local error signals achieve significantly lower error rates than previous BP-free algorithms on MNIST and CIFAR-10 and even surpass BP-enabled baselines. In the case of ImageNet, our LFNN-l demonstrates superior scalability and outperforms previous BP-free algorithms by a significant margin.  ( 2 min )
    Towards Causal Deep Learning for Vulnerability Detection. (arXiv:2310.07958v1 [cs.SE])
    Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the model learned non-robust features, e.g., variable names, that have spurious correlations with labels. When the perturbed and OOD datasets no longer have the same spurious features, the model prediction fails. To address the challenge, in this paper, we introduced causality into deep learning vulnerability detection. Our approach CausalVul consists of two phases. First, we designed novel perturbations to discover spurious features that the model may use to make predictions. Second, we applied the causal learning algorithms, specifically, do-calculus, on top of existing deep learning models to systematically remove the use of spurious features and thus promote causal based prediction. Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance for all the state-of-the-art models and datasets we experimented. To the best of our knowledge, this is the first work that introduces do calculus based causal learning to software engineering models and shows it's indeed useful for improving the model accuracy, robustness and generalization. Our replication package is located at https://figshare.com/s/0ffda320dcb96c249ef2.  ( 2 min )
    D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning. (arXiv:2310.07931v1 [cs.LG])
    Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models.  ( 3 min )
    Promoting Robustness of Randomized Smoothing: Two Cost-Effective Approaches. (arXiv:2310.07780v1 [cs.LG])
    Randomized smoothing has recently attracted attentions in the field of adversarial robustness to provide provable robustness guarantees on smoothed neural network classifiers. However, existing works show that vanilla randomized smoothing usually does not provide good robustness performance and often requires (re)training techniques on the base classifier in order to boost the robustness of the resulting smoothed classifier. In this work, we propose two cost-effective approaches to boost the robustness of randomized smoothing while preserving its clean performance. The first approach introduces a new robust training method AdvMacerwhich combines adversarial training and robustness certification maximization for randomized smoothing. We show that AdvMacer can improve the robustness performance of randomized smoothing classifiers compared to SOTA baselines, while being 3x faster to train than MACER baseline. The second approach introduces a post-processing method EsbRS which greatly improves the robustness certificate based on building model ensembles. We explore different aspects of model ensembles that has not been studied by prior works and propose a novel design methodology to further improve robustness of the ensemble based on our theoretical analysis.  ( 2 min )
    First-Order Dynamic Optimization for Streaming Convex Costs. (arXiv:2310.07925v1 [math.OC])
    This paper proposes a set of novel optimization algorithms for solving a class of convex optimization problems with time-varying streaming cost function. We develop an approach to track the optimal solution with a bounded error. Unlike the existing results, our algorithm is executed only by using the first-order derivatives of the cost function which makes it computationally efficient for optimization with time-varying cost function. We compare our algorithms to the gradient descent algorithm and show why gradient descent is not an effective solution for optimization problems with time-varying cost. Several examples including solving a model predictive control problem cast as a convex optimization problem with a streaming time-varying cost function demonstrate our results.  ( 2 min )
    Local Graph Clustering with Noisy Labels. (arXiv:2310.08031v1 [cs.LG])
    The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.  ( 3 min )
    A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback. (arXiv:2301.13326v2 [cs.LG] UPDATED)
    We investigate the problem of stochastic, combinatorial multi-armed bandits where the learner only has access to bandit feedback and the reward function can be non-linear. We provide a general framework for adapting discrete offline approximation algorithms into sublinear $\alpha$-regret methods that only require bandit feedback, achieving $\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$ expected cumulative $\alpha$-regret dependence on the horizon $T$. The framework only requires the offline algorithms to be robust to small errors in function evaluation. The adaptation procedure does not even require explicit knowledge of the offline approximation algorithm -- the offline algorithm can be used as a black box subroutine. To demonstrate the utility of the proposed framework, the proposed framework is applied to diverse applications in submodular maximization. The new CMAB algorithms for submodular maximization with knapsack constraints outperform a full-bandit method developed for the adversarial setting in experiments with real-world data.  ( 3 min )
    Learning to Simulate Tree-Branch Dynamics for Manipulation. (arXiv:2306.03410v2 [cs.RO] UPDATED)
    We propose to use a simulation driven inverse inference approach to model the dynamics of tree branches under manipulation. Learning branch dynamics and gaining the ability to manipulate deformable vegetation can help with occlusion-prone tasks, such as fruit picking in dense foliage, as well as moving overhanging vines and branches for navigation in dense vegetation. The underlying deformable tree geometry is encapsulated as coarse spring abstractions executed on parallel, non-differentiable simulators. The implicit statistical model defined by the simulator, reference trajectories obtained by actively probing the ground truth, and the Bayesian formalism, together guide the spring parameter posterior density estimation. Our non-parametric inference algorithm, based on Stein Variational Gradient Descent, incorporates biologically motivated assumptions into the inference process as neural network driven learnt joint priors; moreover, it leverages the finite difference scheme for gradient approximations. Real and simulated experiments confirm that our model can predict deformation trajectories, quantify the estimation uncertainty, and it can perform better when base-lined against other inference algorithms, particularly from the Monte Carlo family. The model displays strong robustness properties in the presence of heteroscedastic sensor noise; furthermore, it can generalise to unseen grasp locations.  ( 2 min )
    A Transfer-Learning-Based Prognosis Prediction Paradigm that Bridges Data Distribution Shift across EMR Datasets. (arXiv:2310.07799v1 [cs.LG])
    Due to the limited information about emerging diseases, symptoms are hard to be noticed and recognized, so that the window for clinical intervention could be ignored. An effective prognostic model is expected to assist doctors in making right diagnosis and designing personalized treatment plan, so to promptly prevent unfavorable outcomes. However, in the early stage of a disease, limited data collection and clinical experiences, plus the concern out of privacy and ethics, may result in restricted data availability for reference, to the extent that even data labels are difficult to mark correctly. In addition, Electronic Medical Record (EMR) data of different diseases or of different sources of the same disease can prove to be having serious cross-dataset feature misalignment problems, greatly mutilating the efficiency of deep learning models. This article introduces a transfer learning method to build a transition model from source dataset to target dataset. By way of constraining the distribution shift of features generated in disparate domains, domain-invariant features that are exclusively relative to downstream tasks are captured, so to cultivate a unified domain-invariant encoder across various task domains to achieve better feature representation. Experimental results of several target tasks demonstrate that our proposed model outperforms competing baseline methods and has higher rate of training convergence, especially in dealing with limited data amount. A multitude of experiences have proven the efficacy of our method to provide more accurate predictions concerning newly emergent pandemics and other diseases.  ( 3 min )
    Elastic Decision Transformer. (arXiv:2307.02484v5 [cs.LG] UPDATED)
    This paper introduces Elastic Decision Transformer (EDT), a significant advancement over the existing Decision Transformer (DT) and its variants. Although DT purports to generate an optimal trajectory, empirical evidence suggests it struggles with trajectory stitching, a process involving the generation of an optimal or near-optimal trajectory from the best parts of a set of sub-optimal trajectories. The proposed EDT differentiates itself by facilitating trajectory stitching during action inference at test time, achieved by adjusting the history length maintained in DT. Further, the EDT optimizes the trajectory by retaining a longer history when the previous trajectory is optimal and a shorter one when it is sub-optimal, enabling it to "stitch" with a more optimal trajectory. Extensive experimentation demonstrates EDT's ability to bridge the performance gap between DT-based and Q Learning-based approaches. In particular, the EDT outperforms Q Learning-based methods in a multi-task regime on the D4RL locomotion benchmark and Atari games. Videos are available at: https://kristery.github.io/edt/  ( 2 min )
    Limits of Model Selection under Transfer Learning. (arXiv:2305.00152v4 [stat.ML] UPDATED)
    Theoretical studies on transfer learning or domain adaptation have so far focused on situations with a known hypothesis class or model; however in practice, some amount of model selection is usually involved, often appearing under the umbrella term of hyperparameter-tuning: for example, one may think of the problem of tuning for the right neural network architecture towards a target task, while leveraging data from a related source task. Now, in addition to the usual tradeoffs on approximation vs estimation errors involved in model selection, this problem brings in a new complexity term, namely, the transfer distance between source and target distributions, which is known to vary with the choice of hypothesis class. We present a first study of this problem, focusing on classification; in particular, the analysis reveals some remarkable phenomena: adaptive rates, i.e., those achievable with no distributional information, can be arbitrarily slower than oracle rates, i.e., when given knowledge on distances.  ( 2 min )
    GROOT: Learning to Follow Instructions by Watching Gameplay Videos. (arXiv:2310.08235v1 [cs.AI])
    We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. Code and video can be found on the website https://craftjarvis-groot.github.io.  ( 2 min )
    XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation. (arXiv:2310.08182v1 [cs.CV])
    The lack of standardized robustness metrics and the widespread reliance on numerous unrelated benchmark datasets for testing have created a gap between academically validated robust models and their often problematic practical adoption. To address this, we introduce XIMAGENET-12, an explainable benchmark dataset with over 200K images and 15,600 manual semantic annotations. Covering 12 categories from ImageNet to represent objects commonly encountered in practical life and simulating six diverse scenarios, including overexposure, blurring, color changing, etc., we further propose a novel robustness criterion that extends beyond model generation ability assessment. This benchmark dataset, along with related code, is available at https://sites.google.com/view/ximagenet-12/home. Researchers and practitioners can leverage this resource to evaluate the robustness of their visual models under challenging conditions and ultimately benefit from the demands of practical computer vision systems.  ( 2 min )
    Samples on Thin Ice: Re-Evaluating Adversarial Pruning of Neural Networks. (arXiv:2310.08073v1 [cs.LG])
    Neural network pruning has shown to be an effective technique for reducing the network size, trading desirable properties like generalization and robustness to adversarial attacks for higher sparsity. Recent work has claimed that adversarial pruning methods can produce sparse networks while also preserving robustness to adversarial examples. In this work, we first re-evaluate three state-of-the-art adversarial pruning methods, showing that their robustness was indeed overestimated. We then compare pruned and dense versions of the same models, discovering that samples on thin ice, i.e., closer to the unpruned model's decision boundary, are typically misclassified after pruning. We conclude by discussing how this intuition may lead to designing more effective adversarial pruning methods in future work.  ( 2 min )
    Data driven modeling of self-similar dynamics. (arXiv:2310.08282v1 [cs.LG])
    Multiscale modeling of complex systems is crucial for understanding their intricacies. Data-driven multiscale modeling has emerged as a promising approach to tackle challenges associated with complex systems. On the other hand, self-similarity is prevalent in complex systems, hinting that large-scale complex systems can be modeled at a reduced cost. In this paper, we introduce a multiscale neural network framework that incorporates self-similarity as prior knowledge, facilitating the modeling of self-similar dynamical systems. For deterministic dynamics, our framework can discern whether the dynamics are self-similar. For uncertain dynamics, it can compare and determine which parameter set is closer to self-similarity. The framework allows us to extract scale-invariant kernels from the dynamics for modeling at any scale. Moreover, our method can identify the power law exponents in self-similar systems. Preliminary tests on the Ising model yielded critical exponents consistent with theoretical expectations, providing valuable insights for addressing critical phase transitions in non-equilibrium systems.  ( 2 min )
    To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer. (arXiv:2310.08078v1 [cs.CL])
    Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having diverse text representation modalities including 2 segmentation-based models (\texttt{BERT}, \texttt{mBERT}), 1 image-based model (\texttt{PIXEL}), and 1 character-level model (\texttt{CANINE}). First, we propose a scoring Language Quotient (LQ) metric capable of providing a weighted representation of both zero-shot and few-shot evaluation combined. Utilizing this metric, we perform experiments comprising 19 source languages and 133 target languages on three tasks (POS tagging, Dependency parsing, and NER). Our analysis reveals that image-based models excel in cross-lingual transfer when languages are closely related and share visually similar scripts. However, for tasks biased toward word meaning (POS, NER), segmentation-based models prove to be superior. Furthermore, in dependency parsing tasks where word relationships play a crucial role, models with their character-level focus, outperform others. Finally, we propose a recommendation scheme based on our findings to guide model selection according to task and language requirements.  ( 2 min )
    Learning from Label Proportions: Bootstrapping Supervised Learners via Belief Propagation. (arXiv:2310.08056v1 [cs.LG])
    Learning from Label Proportions (LLP) is a learning problem where only aggregate level labels are available for groups of instances, called bags, during training, and the aim is to get the best performance at the instance-level on the test data. This setting arises in domains like advertising and medicine due to privacy considerations. We propose a novel algorithmic framework for this problem that iteratively performs two main steps. For the first step (Pseudo Labeling) in every iteration, we define a Gibbs distribution over binary instance labels that incorporates a) covariate information through the constraint that instances with similar covariates should have similar labels and b) the bag level aggregated label. We then use Belief Propagation (BP) to marginalize the Gibbs distribution to obtain pseudo labels. In the second step (Embedding Refinement), we use the pseudo labels to provide supervision for a learner that yields a better embedding. Further, we iterate on the two steps again by using the second step's embeddings as new covariates for the next iteration. In the final iteration, a classifier is trained using the pseudo labels. Our algorithm displays strong gains against several SOTA baselines (up to 15%) for the LLP Binary Classification problem on various dataset types - tabular and Image. We achieve these improvements with minimal computational overhead above standard supervised learning due to Belief Propagation, for large bag sizes, even for a million samples.  ( 2 min )
    Overview of Physics-Informed Machine Learning Inversion of Geophysical Data. (arXiv:2310.08109v1 [physics.geo-ph])
    We review four types of algorithms for physics-informed machine learning (PIML) inversion of geophysical data. The unifying equation is given by the joint objective function $\epsilon$: \begin{eqnarray} \epsilon^{||-PIML}&=&\lambda_1 \overbrace{||{\bf W}^{ML}({\bf H}_{{\bf w}} {\bf d}^{obs}-{\bf m})||^2}^{NN} + \lambda_2 \overbrace{{||{\bf W}^{FWI}({\bf L} {\bf m}-{\bf d}^{obs})||^2}}^{FWI} ~+ \nonumber\\ \nonumber\\ && + ~~Regularizer, \label{PIML.eq120} \end{eqnarray}where the optimal model ${\bf m}^*$ and weights $\bf w^*$ minimize $\epsilon$. Here, The matrix weights are given by the boldface symbol $\bf W$, and full waveform inversion (FWI) is typically computed using a finite-difference solution of the wave equation, where $\bf L$ represents the forward modeling operation of the wave equation as a function of the model $\bf m$. Also, a fully-connected neural network (NN) is used to compute the model ${\bf H_w}{\bf d}^{obs} \approx \bf m$ from the observed input data ${\bf d}^{obs}$. The selection of weights $\lambda_i$ and the NN operations determine one of four different PIML algorithms. PIML offers potential advantages over standard FWI through its enhanced ability to avoid local minima and the option to locally train the inversion operator, minimizing the requirement for extensive training data for global applicability. However, the effectiveness of PIML relies on the similarity between the test and trained data. Nevertheless, a possible strategy to overcome this limitation involves initial pretraining of a PIML architecture with data from a broader region, followed by fine-tuning for specific data-a method reminiscent of the way large language models are pretrained and adapted for various tasks.  ( 2 min )
    LGL-BCI: A Lightweight Geometric Learning Framework for Motor Imagery-Based Brain-Computer Interfaces. (arXiv:2310.08051v1 [cs.LG])
    Brain-Computer Interfaces (BCIs) are a groundbreaking technology for interacting with external devices using brain signals. Despite advancements, electroencephalogram (EEG)-based Motor Imagery (MI) tasks face challenges like amplitude and phase variability, and complex spatial correlations, with a need for smaller model size and faster inference. This study introduces the LGL-BCI framework, employing a Geometric Deep Learning Framework for EEG processing in non-Euclidean metric spaces, particularly the Symmetric Positive Definite (SPD) Manifold space. LGL-BCI offers robust EEG data representation and captures spatial correlations. We propose an EEG channel selection solution via a feature decomposition algorithm to reduce SPD matrix dimensionality, with a lossless transformation boosting inference speed. Extensive experiments show LGL-BCI's superior accuracy and efficiency compared to current solutions, highlighting geometric deep learning's potential in MI-BCI applications. The efficiency, assessed on two public EEG datasets and two real-world EEG devices, significantly outperforms the state-of-the-art solution in accuracy ($82.54\%$ versus $62.22\%$) with fewer parameters (64.9M compared to 183.7M).  ( 2 min )
    SimCKP: Simple Contrastive Learning of Keyphrase Representations. (arXiv:2310.08221v1 [cs.CL])
    Keyphrase generation (KG) aims to generate a set of summarizing words or phrases given a source document, while keyphrase extraction (KE) aims to identify them from the text. Because the search space is much smaller in KE, it is often combined with KG to predict keyphrases that may or may not exist in the corresponding document. However, current unified approaches adopt sequence labeling and maximization-based generation that primarily operate at a token level, falling short in observing and scoring keyphrases as a whole. In this work, we propose SimCKP, a simple contrastive learning framework that consists of two stages: 1) An extractor-generator that extracts keyphrases by learning context-aware phrase-level representations in a contrastive manner while also generating keyphrases that do not appear in the document; 2) A reranker that adapts scores for each generated phrase by likewise aligning their representations with the corresponding document. Experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art models by a significant margin.  ( 2 min )
    Multi-SpacePhish: Extending the Evasion-space of Adversarial Attacks against Phishing Website Detectors using Machine Learning. (arXiv:2210.13660v3 [cs.CR] UPDATED)
    Existing literature on adversarial Machine Learning (ML) focuses either on showing attacks that break every ML model, or defenses that withstand most attacks. Unfortunately, little consideration is given to the actual feasibility of the attack or the defense. Moreover, adversarial samples are often crafted in the "feature-space", making the corresponding evaluations of questionable value. Simply put, the current situation does not allow to estimate the actual threat posed by adversarial attacks, leading to a lack of secure ML systems. We aim to clarify such confusion in this paper. By considering the application of ML for Phishing Website Detection (PWD), we formalize the "evasion-space" in which an adversarial perturbation can be introduced to fool a ML-PWD -- demonstrating that even perturbations in the "feature-space" are useful. Then, we propose a realistic threat model describing evasion attacks against ML-PWD that are cheap to stage, and hence intrinsically more attractive for real phishers. After that, we perform the first statistically validated assessment of state-of-the-art ML-PWD against 12 evasion attacks. Our evaluation shows (i) the true efficacy of evasion attempts that are more likely to occur; and (ii) the impact of perturbations crafted in different evasion-spaces. Our realistic evasion attempts induce a statistically significant degradation (3-10% at p<0.05), and their cheap cost makes them a subtle threat. Notably, however, some ML-PWD are immune to our most realistic attacks (p=0.22). Finally, as an additional contribution of this journal publication, we are the first to consider the intriguing case wherein an attacker introduces perturbations in multiple evasion-spaces at the same time. These new results show that simultaneously applying perturbations in the problem- and feature-space can cause a drop in the detection rate from 0.95 to 0.  ( 3 min )
    PRiSM: Enhancing Low-Resource Document-Level Relation Extraction with Relation-Aware Score Calibration. (arXiv:2309.13869v1 [cs.CL] CROSS LISTED)
    Document-level relation extraction (DocRE) aims to extract relations of all entity pairs in a document. A key challenge in DocRE is the cost of annotating such data which requires intensive human effort. Thus, we investigate the case of DocRE in a low-resource setting, and we find that existing models trained on low data overestimate the NA ("no relation") label, causing limited performance. In this work, we approach the problem from a calibration perspective and propose PRiSM, which learns to adapt logits based on relation semantic information. We evaluate our method on three DocRE datasets and demonstrate that integrating existing models with PRiSM improves performance by as much as 26.38 F1 score, while the calibration error drops as much as 36 times when trained with about 3% of data. The code is publicly available at https://github.com/brightjade/PRiSM.  ( 2 min )
    Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs. (arXiv:2309.15395v2 [cs.LG] UPDATED)
    This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs we discover, called limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs a near-optimal policy with a high probability at the end of learning; and (iii) in the tabular setting, PRI guarantees $\tilde{\mathcal{O}}(\sqrt{K})$ regret and constraint violation, which significantly improves the best existing regret bound $\tilde{\mathcal{O}}(K^{\frac{4}{5}})$ under a model-free algorithm, where $K$ is the total number of episodes.  ( 2 min )
    Rethinking the BERT-like Pretraining for DNA Sequences. (arXiv:2310.07644v2 [cs.AI] UPDATED)
    With the success of large-scale pretraining in NLP, there is an increasing trend of applying it to the domain of life sciences. In particular, pretraining methods based on DNA sequences have garnered growing attention due to their potential to capture generic information about genes. However, existing pretraining methods for DNA sequences largely rely on direct adoptions of BERT pretraining from NLP, lacking a comprehensive understanding and a specifically tailored approach. To address this research gap, we first conducted a series of exploratory experiments and gained several insightful observations: 1) In the fine-tuning phase of downstream tasks, when using K-mer overlapping tokenization instead of K-mer non-overlapping tokenization, both overlapping and non-overlapping pretraining weights show consistent performance improvement.2) During the pre-training process, using K-mer overlapping tokenization quickly produces clear K-mer embeddings and reduces the loss to a very low level, while using K-mer non-overlapping tokenization results in less distinct embeddings and continuously decreases the loss. 3) Using overlapping tokenization causes the self-attention in the intermediate layers of pre-trained models to tend to overly focus on certain tokens, reflecting that these layers are not adequately optimized. In summary, overlapping tokenization can benefit the fine-tuning of downstream tasks but leads to inadequate pretraining with fast convergence. To unleash the pretraining potential, we introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pretraining by continuously expanding its mask boundary, forcing the model to learn more knowledge. RandomMask is simple but effective, achieving top-tier performance across 26 datasets of 28 datasets spanning 7 downstream tasks.  ( 3 min )
    Memorization Capacity of Multi-Head Attention in Transformers. (arXiv:2306.02010v2 [cs.LG] UPDATED)
    Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention mechanisms, examining how many example sequences they can memorize, as a function of the number of heads and sequence length. Motivated by experimental findings on vision transformers, we introduce novel assumptions about the linear independence of input data, distinct from the commonly used general-position assumption. Under these assumptions, we demonstrate that an attention layer with $H$ heads, dimension $d$, and context size $n < d$, featuring $\Theta(Hd^2)$ parameters, can memorize $\Omega(Hn)$ examples. Our analysis sheds light on how different attention heads handle various example sequences, aided by the softmax operator's saturation property. We validate our findings through experiments on synthetic data.  ( 2 min )
    CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving. (arXiv:2310.07794v1 [cs.CV])
    Benchmarking is a common method for evaluating trajectory prediction models for autonomous driving. Existing benchmarks rely on datasets, which are biased towards more common scenarios, such as cruising, and distance-based metrics that are computed by averaging over all scenarios. Following such a regiment provides a little insight into the properties of the models both in terms of how well they can handle different scenarios and how admissible and diverse their outputs are. There exist a number of complementary metrics designed to measure the admissibility and diversity of trajectories, however, they suffer from biases, such as length of trajectories. In this paper, we propose a new benChmarking paRadIgm for evaluaTing trajEctoRy predIction Approaches (CRITERIA). Particularly, we propose 1) a method for extracting driving scenarios at varying levels of specificity according to the structure of the roads, models' performance, and data properties for fine-grained ranking of prediction models; 2) A set of new bias-free metrics for measuring diversity, by incorporating the characteristics of a given scenario, and admissibility, by considering the structure of roads and kinematic compliancy, motivated by real-world driving constraints. 3) Using the proposed benchmark, we conduct extensive experimentation on a representative set of the prediction models using the large scale Argoverse dataset. We show that the proposed benchmark can produce a more accurate ranking of the models and serve as a means of characterizing their behavior. We further present ablation studies to highlight contributions of different elements that are used to compute the proposed metrics.  ( 3 min )
    Dynamic Subgoal-based Exploration via Bayesian Optimization. (arXiv:1910.09143v5 [math.OC] UPDATED)
    Reinforcement learning in sparse-reward navigation environments with expensive and limited interactions is challenging and poses a need for effective exploration. Motivated by complex navigation tasks that require real-world training (when cheap simulators are not available), we consider an agent that faces an unknown distribution of environments and must decide on an exploration strategy. It may leverage a series of training environments to improve its policy before it is evaluated in a test environment drawn from the same environment distribution. Most existing approaches focus on fixed exploration strategies, while the few that view exploration as a meta-optimization problem tend to ignore the need for cost-efficient exploration. We propose a cost-aware Bayesian optimization approach that efficiently searches over a class of dynamic subgoal-based exploration strategies. The algorithm adjusts a variety of levers -- the locations of the subgoals, the length of each episode, and the number of replications per trial -- in order to overcome the challenges of sparse rewards, expensive interactions, and noise. An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains. We also provide a theoretical foundation and prove that the method asymptotically identifies a near-optimal subgoal design.  ( 2 min )
    Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning. (arXiv:2301.10886v5 [cs.LG] UPDATED)
    We present AIRS: Automatic Intrinsic Reward Shaping that intelligently and adaptively provides high-quality intrinsic rewards to enhance exploration in reinforcement learning (RL). More specifically, AIRS selects shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem. Moreover, we develop an intrinsic reward toolkit to provide efficient and reliable implementations of diverse intrinsic reward approaches. We test AIRS on various tasks of MiniGrid, Procgen, and DeepMind Control Suite. Extensive simulation demonstrates that AIRS can outperform the benchmarking schemes and achieve superior performance with simple architecture.  ( 2 min )
    Refined Mechanism Design for Approximately Structured Priors via Active Regression. (arXiv:2310.07874v1 [cs.GT])
    We consider the problem of a revenue-maximizing seller with a large number of items $m$ for sale to $n$ strategic bidders, whose valuations are drawn independently from high-dimensional, unknown prior distributions. It is well-known that optimal and even approximately-optimal mechanisms for this setting are notoriously difficult to characterize or compute, and, even when they can be found, are often rife with various counter-intuitive properties. In this paper, following a model introduced recently by Cai and Daskalakis~\cite{cai2022recommender}, we consider the case that bidders' prior distributions can be well-approximated by a topic model. We design an active learning component, responsible for interacting with the bidders and outputting low-dimensional approximations of their types, and a mechanism design component, responsible for robustifying mechanisms for the low-dimensional model to work for the approximate types of the former component. On the active learning front, we cast our problem in the framework of Randomized Linear Algebra (RLA) for regression problems, allowing us to import several breakthrough results from that line of research, and adapt them to our setting. On the mechanism design front, we remove many restrictive assumptions of prior work on the type of access needed to the underlying distributions and the associated mechanisms. To the best of our knowledge, our work is the first to formulate connections between mechanism design, and RLA for active learning of regression problems, opening the door for further applications of randomized linear algebra primitives to mechanism design.  ( 3 min )
    SEE-OoD: Supervised Exploration For Enhanced Out-of-Distribution Detection. (arXiv:2310.08040v1 [cs.LG])
    Current techniques for Out-of-Distribution (OoD) detection predominantly rely on quantifying predictive uncertainty and incorporating model regularization during the training phase, using either real or synthetic OoD samples. However, methods that utilize real OoD samples lack exploration and are prone to overfit the OoD samples at hand. Whereas synthetic samples are often generated based on features extracted from training data, rendering them less effective when the training and OoD data are highly overlapped in the feature space. In this work, we propose a Wasserstein-score-based generative adversarial training scheme to enhance OoD detection accuracy, which, for the first time, performs data augmentation and exploration simultaneously under the supervision of limited OoD samples. Specifically, the generator explores OoD spaces and generates synthetic OoD samples using feedback from the discriminator, while the discriminator exploits both the observed and synthesized samples for OoD detection using a predefined Wasserstein score. We provide theoretical guarantees that the optimal solutions of our generative scheme are statistically achievable through adversarial training in empirical settings. We then demonstrate that the proposed method outperforms state-of-the-art techniques on various computer vision datasets and exhibits superior generalizability to unseen OoD data.  ( 2 min )
    Open-Set Knowledge-Based Visual Question Answering with Inference Paths. (arXiv:2310.08148v1 [cs.LG])
    Given an image and an associated textual question, the purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. Prior KB-VQA models are usually formulated as a retriever-classifier framework, where a pre-trained retriever extracts textual or visual information from knowledge graphs and then makes a prediction among the candidates. Despite promising progress, there are two drawbacks with existing models. Firstly, modeling question-answering as multi-class classification limits the answer space to a preset corpus and lacks the ability of flexible reasoning. Secondly, the classifier merely consider "what is the answer" without "how to get the answer", which cannot ground the answer to explicit reasoning paths. In this paper, we confront the challenge of \emph{explainable open-set} KB-VQA, where the system is required to answer questions with entities at wild and retain an explainable reasoning path. To resolve the aforementioned issues, we propose a new retriever-ranker paradigm of KB-VQA, Graph pATH rankER (GATHER for brevity). Specifically, it contains graph constructing, pruning, and path-level ranking, which not only retrieves accurate answers but also provides inference paths that explain the reasoning process. To comprehensively evaluate our model, we reformulate the benchmark dataset OK-VQA with manually corrected entity-level annotations and release it as ConceptVQA. Extensive experiments on real-world questions demonstrate that our framework is not only able to perform open-set question answering across the whole knowledge base but provide explicit reasoning path.  ( 2 min )
    Variational Imbalanced Regression: Fair Uncertainty Quantification via Probabilistic Smoothing. (arXiv:2306.06599v4 [cs.LG] UPDATED)
    Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I.I.D. representations (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. Code will soon be available at \url{https://github.com/Wang-ML-Lab/variational-imbalanced-regression}.  ( 2 min )
    A Generic Software Framework for Distributed Topological Analysis Pipelines. (arXiv:2310.08339v1 [cs.DC])
    This system paper presents a software framework for the support of topological analysis pipelines in a distributed-memory model. While several recent papers introduced topology-based approaches for distributed-memory environments, these were reporting experiments obtained with tailored, mono-algorithm implementations. In contrast, we describe in this paper a general-purpose, generic framework for topological analysis pipelines, i.e. a sequence of topological algorithms interacting together, possibly on distinct numbers of processes. Specifically, we instantiated our framework with the MPI model, within the Topology ToolKit (TTK). While developing this framework, we faced several algorithmic and software engineering challenges, which we document in this paper. We provide a taxonomy for the distributed-memory topological algorithms supported by TTK, depending on their communication needs and provide examples of hybrid MPI+thread parallelizations. Detailed performance analyses show that parallel efficiencies range from $20\%$ to $80\%$ (depending on the algorithms), and that the MPI-specific preconditioning introduced by our framework induces a negligible computation time overhead. We illustrate the new distributed-memory capabilities of TTK with an example of advanced analysis pipeline, combining multiple algorithms, run on the largest publicly available dataset we have found (120 billion vertices) on a standard cluster with 64 nodes (for a total of 1,536 cores). Finally, we provide a roadmap for the completion of TTK's MPI extension, along with generic recommendations for each algorithm communication category.  ( 3 min )
    Trustworthy Machine Learning. (arXiv:2310.08215v1 [cs.LG])
    As machine learning technology gets applied to actual products and solutions, new challenges have emerged. Models unexpectedly fail to generalize to small changes in the distribution, tend to be confident on novel data they have never seen, or cannot communicate the rationale behind their decisions effectively with the end users. Collectively, we face a trustworthiness issue with the current machine learning technology. This textbook on Trustworthy Machine Learning (TML) covers a theoretical and technical background of four key topics in TML: Out-of-Distribution Generalization, Explainability, Uncertainty Quantification, and Evaluation of Trustworthiness. We discuss important classical and contemporary research papers of the aforementioned fields and uncover and connect their underlying intuitions. The book evolved from the homonymous course at the University of T\"ubingen, first offered in the Winter Semester of 2022/23. It is meant to be a stand-alone product accompanied by code snippets and various pointers to further sources on topics of TML. The dedicated website of the book is https://trustworthyml.io/.  ( 2 min )
    Towards a Unified Analysis of Kernel-based Methods Under Covariate Shift. (arXiv:2310.08237v1 [stat.ML])
    Covariate shift occurs prevalently in practice, where the input distributions of the source and target data are substantially different. Despite its practical importance in various learning problems, most of the existing methods only focus on some specific learning tasks and are not well validated theoretically and numerically. To tackle this problem, we propose a unified analysis of general nonparametric methods in a reproducing kernel Hilbert space (RKHS) under covariate shift. Our theoretical results are established for a general loss belonging to a rich loss function family, which includes many commonly used methods as special cases, such as mean regression, quantile regression, likelihood-based classification, and margin-based classification. Two types of covariate shift problems are the focus of this paper and the sharp convergence rates are established for a general loss function to provide a unified theoretical analysis, which concurs with the optimal results in literature where the squared loss is used. Extensive numerical studies on synthetic and real examples confirm our theoretical findings and further illustrate the effectiveness of our proposed method.  ( 2 min )
    Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics. (arXiv:2310.07990v1 [q-bio.GN])
    Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved r2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.  ( 3 min )
    Generative Modeling with Phase Stochastic Bridges. (arXiv:2310.07805v1 [cs.LG])
    Diffusion models (DMs) represent state-of-the-art generative models for continuous inputs. DMs work by constructing a Stochastic Differential Equation (SDE) in the input space (ie, position space), and using a neural network to reverse it. In this work, we introduce a novel generative modeling framework grounded in \textbf{phase space dynamics}, where a phase space is defined as {an augmented space encompassing both position and velocity.} Leveraging insights from Stochastic Optimal Control, we construct a path measure in the phase space that enables efficient sampling. {In contrast to DMs, our framework demonstrates the capability to generate realistic data points at an early stage of dynamics propagation.} This early prediction sets the stage for efficient data generation by leveraging additional velocity information along the trajectory. On standard image generation benchmarks, our model yields favorable performance over baselines in the regime of small Number of Function Evaluations (NFEs). Furthermore, our approach rivals the performance of diffusion models equipped with efficient sampling techniques, underscoring its potential as a new tool generative modeling.  ( 2 min )
    Optimizing Convolutional Neural Networks for Chronic Obstructive Pulmonary Disease Detection in Clinical Computed Tomography Imaging. (arXiv:2303.07189v3 [eess.IV] UPDATED)
    We aim to optimize the binary detection of Chronic Obstructive Pulmonary Disease (COPD) based on emphysema presence in the lung with convolutional neural networks (CNN) by exploring manually adjusted versus automated window-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT images (3,597 with COPD; 3,597 healthy controls) from 78 subjects (43 with COPD; 35 healthy controls) were selected retrospectively (10.2018-12.2019) and preprocessed. For each image, intensity values were manually clipped to the emphysema window setting and a baseline 'full-range' window setting. Class-balanced train, validation, and test sets contained 3,392, 1,114, and 2,688 images. The network backbone was optimized by comparing various CNN architectures. Furthermore, automated WSO was implemented by adding a customized layer to the model. The image-level area under the Receiver Operating Characteristics curve (AUC) [lower, upper limit 95% confidence] was utilized to compare model variations. Repeated inference (n=7) on the test set showed that the DenseNet was the most efficient backbone and achieved a mean AUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input images manually adjusted to the emphysema window, the DenseNet model predicted COPD with a mean AUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to the DenseNet, an optimal window in the proximity of the emphysema window setting was learned automatically, and a mean AUC of 0.82 [0.78, 0.86] was achieved. Detection of COPD with DenseNet models was improved by WSO of CT data to the emphysema window setting range.  ( 3 min )
    High-fidelity Pseudo-labels for Boosting Weakly-Supervised Segmentation. (arXiv:2304.02621v2 [cs.CV] UPDATED)
    Image-level weakly-supervised semantic segmentation (WSSS) reduces the usually vast data annotation cost by surrogate segmentation masks during training. The typical approach involves training an image classification network using global average pooling (GAP) on convolutional feature maps. This enables the estimation of object locations based on class activation maps (CAMs), which identify the importance of image regions. The CAMs are then used to generate pseudo-labels, in the form of segmentation masks, to supervise a segmentation model in the absence of pixel-level ground truth. Our work is based on two techniques for improving CAMs; importance sampling, which is a substitute for GAP, and the feature similarity loss, which utilizes a heuristic that object contours almost always align with color edges in images. However, both are based on the multinomial posterior with softmax, and implicitly assume that classes are mutually exclusive, which turns out suboptimal in our experiments. Thus, we reformulate both techniques based on binomial posteriors of multiple independent binary problems. This has two benefits; their performance is improved and they become more general, resulting in an add-on method that can boost virtually any WSSS method. This is demonstrated on a wide variety of baselines on the PASCAL VOC dataset, improving the region similarity and contour quality of all implemented state-of-the-art methods. Experiments on the MS COCO dataset show that our proposed add-on is well-suited for large-scale settings. Our code is available at https://github.com/arvijj/hfpl.  ( 3 min )
    Differentially-Private Decision Trees and Provable Robustness to Data Poisoning. (arXiv:2305.15394v2 [cs.LG] UPDATED)
    Decision trees are interpretable models that are well-suited to non-linear learning problems. Much work has been done on extending decision tree learning algorithms with differential privacy, a system that guarantees the privacy of samples within the training data. However, current state-of-the-art algorithms for this purpose sacrifice much utility for a small privacy benefit. These solutions create random decision nodes that reduce decision tree accuracy or spend an excessive share of the privacy budget on labeling leaves. Moreover, many works do not support continuous features or leak information about them. We propose a new method called PrivaTree based on private histograms that chooses good splits while consuming a small privacy budget. The resulting trees provide a significantly better privacy-utility trade-off and accept mixed numerical and categorical data without leaking information about numerical features. Finally, while it is notoriously hard to give robustness guarantees against data poisoning attacks, we demonstrate bounds for the expected accuracy and success rates of backdoor attacks against differentially-private learners. By leveraging the better privacy-utility trade-off of PrivaTree we are able to train decision trees with significantly better robustness against backdoor attacks compared to regular decision trees and with meaningful theoretical guarantees.  ( 2 min )
    A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors. (arXiv:2310.08287v1 [stat.ML])
    The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding its study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will shortly release the first large-scale checkpoint dataset, including thousands of real-world models and our codes.
    Learning Transferable Conceptual Prototypes for Interpretable Unsupervised Domain Adaptation. (arXiv:2310.08071v1 [cs.LG])
    Despite the great progress of unsupervised domain adaptation (UDA) with the deep neural networks, current UDA models are opaque and cannot provide promising explanations, limiting their applications in the scenarios that require safe and controllable model decisions. At present, a surge of work focuses on designing deep interpretable methods with adequate data annotations and only a few methods consider the distributional shift problem. Most existing interpretable UDA methods are post-hoc ones, which cannot facilitate the model learning process for performance enhancement. In this paper, we propose an inherently interpretable method, named Transferable Conceptual Prototype Learning (TCPL), which could simultaneously interpret and improve the processes of knowledge transfer and decision-making in UDA. To achieve this goal, we design a hierarchically prototypical module that transfers categorical basic concepts from the source domain to the target domain and learns domain-shared prototypes for explaining the underlying reasoning process. With the learned transferable prototypes, a self-predictive consistent pseudo-label strategy that fuses confidence, predictions, and prototype information, is designed for selecting suitable target samples for pseudo annotations and gradually narrowing down the domain gap. Comprehensive experiments show that the proposed method can not only provide effective and intuitive explanations but also outperform previous state-of-the-arts.  ( 2 min )
    Graph-SCP: Accelerating Set Cover Problems with Graph Neural Networks. (arXiv:2310.07979v1 [cs.LG])
    Machine learning (ML) approaches are increasingly being used to accelerate combinatorial optimization (CO) problems. We look specifically at the Set Cover Problem (SCP) and propose Graph-SCP, a graph neural network method that can augment existing optimization solvers by learning to identify a much smaller sub-problem that contains the solution space. We evaluate the performance of Graph-SCP on synthetic weighted and unweighted SCP instances with diverse problem characteristics and complexities, and on instances from the OR Library, a canonical benchmark for SCP. We show that Graph-SCP reduces the problem size by 30-70% and achieves run time speedups up to~25x when compared to commercial solvers (Gurobi). Given a desired optimality threshold, Graph-SCP will improve upon it or even achieve 100% optimality. This is in contrast to fast greedy solutions that significantly compromise solution quality to achieve guaranteed polynomial run time. Graph-SCP can generalize to larger problem sizes and can be used with other conventional or ML-augmented CO solvers to lead to potential additional run time improvement.  ( 2 min )
    Hyperparameter Adaptive Search for Surrogate Optimization: A Self-Adjusting Approach. (arXiv:2310.07970v1 [cs.LG])
    Surrogate Optimization (SO) algorithms have shown promise for optimizing expensive black-box functions. However, their performance is heavily influenced by hyperparameters related to sampling and surrogate fitting, which poses a challenge to their widespread adoption. We investigate the impact of hyperparameters on various SO algorithms and propose a Hyperparameter Adaptive Search for SO (HASSO) approach. HASSO is not a hyperparameter tuning algorithm, but a generic self-adjusting SO algorithm that dynamically tunes its own hyperparameters while concurrently optimizing the primary objective function, without requiring additional evaluations. The aim is to improve the accessibility, effectiveness, and convergence speed of SO algorithms for practitioners. Our approach identifies and modifies the most influential hyperparameters specific to each problem and SO approach, reducing the need for manual tuning without significantly increasing the computational burden. Experimental results demonstrate the effectiveness of HASSO in enhancing the performance of various SO algorithms across different global optimization test problems.  ( 2 min )
    Discerning Temporal Difference Learning. (arXiv:2310.08091v1 [cs.LG])
    Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL), aimed at efficiently assessing a policy's value function. TD($\lambda$), a potent variant, incorporates a memory trace to distribute the prediction error into the historical context. However, this approach often neglects the significance of historical states and the relative importance of propagating the TD error, influenced by challenges such as visitation imbalance or outcome noise. To address this, we propose a novel TD algorithm named discerning TD learning (DTD), which allows flexible emphasis functions$-$predetermined or adapted during training$-$to allocate efforts effectively across states. We establish the convergence properties of our method within a specific class of emphasis functions and showcase its promising potential for adaptation to deep RL contexts. Empirical results underscore that employing a judicious emphasis function not only improves value estimation but also expedites learning across diverse scenarios.  ( 2 min )
    Unraveling the Single Tangent Space Fallacy: An Analysis and Clarification for Applying Riemannian Geometry in Robot Learning. (arXiv:2310.07902v1 [cs.RO])
    In the realm of robotics, numerous downstream robotics tasks leverage machine learning methods for processing, modeling, or synthesizing data. Often, this data comprises variables that inherently carry geometric constraints, such as the unit-norm condition of quaternions representing rigid-body orientations or the positive definiteness of stiffness and manipulability ellipsoids. Handling such geometric constraints effectively requires the incorporation of tools from differential geometry into the formulation of machine learning methods. In this context, Riemannian manifolds emerge as a powerful mathematical framework to handle such geometric constraints. Nevertheless, their recent adoption in robot learning has been largely characterized by a mathematically-flawed simplification, hereinafter referred to as the ``single tangent space fallacy". This approach involves merely projecting the data of interest onto a single tangent (Euclidean) space, over which an off-the-shelf learning algorithm is applied. This paper provides a theoretical elucidation of various misconceptions surrounding this approach and offers experimental evidence of its shortcomings. Finally, it presents valuable insights to promote best practices when employing Riemannian geometry within robot learning applications.  ( 2 min )
    LEMON: Lossless model expansion. (arXiv:2310.07999v1 [cs.LG])
    Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.  ( 2 min )
    TabLib: A Dataset of 627M Tables with Context. (arXiv:2310.07875v1 [cs.CL])
    It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.  ( 2 min )
    Enhanced sampling of Crystal Nucleation with Graph Representation Learnt Variables. (arXiv:2310.07927v1 [cond-mat.stat-mech])
    In this study, we present a graph neural network-based learning approach using an autoencoder setup to derive low-dimensional variables from features observed in experimental crystal structures. These variables are then biased in enhanced sampling to observe state-to-state transitions and reliable thermodynamic weights. Our approach uses simple convolution and pooling methods. To verify the effectiveness of our protocol, we examined the nucleation of various allotropes and polymorphs of iron and glycine from their molten states. Our graph latent variables when biased in well-tempered metadynamics consistently show transitions between states and achieve accurate free energy calculations in agreement with experiments, both of which are indicators of dependable sampling. This underscores the strength and promise of our graph neural net variables for improved sampling. The protocol shown here should be applicable for other systems and with other sampling methods.  ( 2 min )
    Online RL in Linearly $q^\pi$-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore. (arXiv:2310.07811v1 [cs.LG])
    We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.  ( 3 min )
    Large Language Models Are Zero-Shot Time Series Forecasters. (arXiv:2310.07820v1 [cs.LG])
    By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text. Developing this approach, we find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks. To facilitate this performance, we propose procedures for effectively tokenizing time series data and converting discrete distributions over tokens into highly flexible densities over continuous values. We argue the success of LLMs for time series stems from their ability to naturally represent multimodal distributions, in conjunction with biases for simplicity, and repetition, which align with the salient features in many time series, such as repeated seasonal trends. We also show how LLMs can naturally handle missing data without imputation through non-numerical text, accommodate textual side information, and answer questions to help explain predictions. While we find that increasing model size generally improves performance on time series, we show GPT-4 can perform worse than GPT-3 because of how it tokenizes numbers, and poor uncertainty calibration, which is likely the result of alignment interventions such as RLHF.  ( 2 min )
    ASV Station Keeping under Wind Disturbances using Neural Network Simulation Error Minimization Model Predictive Control. (arXiv:2310.07892v1 [cs.RO])
    Station keeping is an essential maneuver for Autonomous Surface Vehicles (ASVs), mainly when used in confined spaces, to carry out surveys that require the ASV to keep its position or in collaboration with other vehicles where the relative position has an impact over the mission. However, this maneuver can become challenging for classic feedback controllers due to the need for an accurate model of the ASV dynamics and the environmental disturbances. This work proposes a Model Predictive Controller using Neural Network Simulation Error Minimization (NNSEM-MPC) to accurately predict the dynamics of the ASV under wind disturbances. The performance of the proposed scheme under wind disturbances is tested and compared against other controllers in simulation, using the Robotics Operating System (ROS) and the multipurpose simulation environment Gazebo. A set of six tests were conducted by combining two wind speeds (3 m/s and 6 m/s) and three wind directions (0$^\circ$, 90$^\circ$, and 180$^\circ$). The simulation results clearly show the advantage of the NNSEM-MPC over the following methods: backstepping controller, sliding mode controller, simplified dynamics MPC (SD-MPC), neural ordinary differential equation MPC (NODE-MPC), and knowledge-based NODE MPC (KNODE-MPC). The proposed NNSEM-MPC approach performs better than the rest in 4 out of the 6 test conditions, and it is the second best in the 2 remaining test cases, reducing the mean position and heading error by at least 31\% and 46\% respectively across all the test cases. In terms of execution speed, the proposed NNSEM-MPC is at least 36\% faster than the rest of the MPC controllers. The field experiments on two different ASV platforms showed that ASVs can effectively keep the station utilizing the proposed method, with a position error as low as $1.68$ m and a heading error as low as $6.14^{\circ}$ within time windows of at least $150$s.  ( 3 min )
    NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration. (arXiv:2310.07896v1 [cs.RO])
    Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navigation strategies. In this paper, we describe how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration, with the latter providing the ability to search novel environments, and the former providing the ability to reach a user-specified goal once it has been located. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments, as compared to approaches that use subgoal proposals from generative models, or prior methods based on latent variable models. We instantiate our method by using a large-scale Transformer-based policy trained on data from multiple ground robots, with a diffusion model decoder to flexibly handle both goal-conditioned and goal-agnostic navigation. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods, and demonstrate significant improvements in performance and lower collision rates, despite utilizing smaller models than state-of-the-art approaches. For more videos, code, and pre-trained model checkpoints, see https://general-navigation-models.github.io/nomad/  ( 2 min )
    RandCom: Random Communication Skipping Method for Decentralized Stochastic Optimization. (arXiv:2310.07983v1 [cs.LG])
    Distributed optimization methods with random communication skips are gaining increasing attention due to their proven benefits in accelerating communication complexity. Nevertheless, existing research mainly focuses on centralized communication protocols for strongly convex deterministic settings. In this work, we provide a decentralized optimization method called RandCom, which incorporates probabilistic local updates. We analyze the performance of RandCom in stochastic non-convex, convex, and strongly convex settings and demonstrate its ability to asymptotically reduce communication overhead by the probability of communication. Additionally, we prove that RandCom achieves linear speedup as the number of nodes increases. In stochastic strongly convex settings, we further prove that RandCom can achieve linear speedup with network-independent stepsizes. Moreover, we apply RandCom to federated learning and provide positive results concerning the potential for achieving linear speedup and the suitability of the probabilistic local update approach for non-convex settings.  ( 2 min )
    A Review of Machine Learning Techniques in Imbalanced Data and Future Trends. (arXiv:2310.07917v1 [cs.LG])
    For over two decades, detecting rare events has been a challenging task among researchers in the data mining and machine learning domain. Real-life problems inspire researchers to navigate and further improve data processing and algorithmic approaches to achieve effective and computationally efficient methods for imbalanced learning. In this paper, we have collected and reviewed 258 peer-reviewed papers from archival journals and conference papers in an attempt to provide an in-depth review of various approaches in imbalanced learning from technical and application perspectives. This work aims to provide a structured review of methods used to address the problem of imbalanced data in various domains and create a general guideline for researchers in academia or industry who want to dive into the broad field of machine learning using large-scale imbalanced data.  ( 2 min )
    QArchSearch: A Scalable Quantum Architecture Search Package. (arXiv:2310.07858v1 [quant-ph])
    The current era of quantum computing has yielded several algorithms that promise high computational efficiency. While the algorithms are sound in theory and can provide potentially exponential speedup, there is little guidance on how to design proper quantum circuits to realize the appropriate unitary transformation to be applied to the input quantum state. In this paper, we present \texttt{QArchSearch}, an AI based quantum architecture search package with the \texttt{QTensor} library as a backend that provides a principled and automated approach to finding the best model given a task and input quantum state. We show that the search package is able to efficiently scale the search to large quantum circuits and enables the exploration of more complex models for different quantum applications. \texttt{QArchSearch} runs at scale and high efficiency on high-performance computing systems using a two-level parallelization scheme on both CPUs and GPUs, which has been demonstrated on the Polaris supercomputer.  ( 2 min )
    When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement. (arXiv:2310.07831v1 [cs.LG])
    Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. When considering only worst-case analysis, our theory predicts that the best choice is the linear decay schedule: a popular choice in practice that sets the stepsize proportionally to $1 - t/T$, where $t$ is the current iteration and $T$ is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.  ( 3 min )
    Feature Learning and Generalization in Deep Networks with Orthogonal Weights. (arXiv:2310.07765v1 [cs.LG])
    Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.  ( 2 min )
    Faithfulness Measurable Masked Language Models. (arXiv:2310.07819v1 [cs.CL])
    A common approach to explain NLP models, is to use importance measures that express which tokens are important for a prediction. Unfortunately, such explanations are often wrong despite being persuasive. Therefore, it is essential to measure their faithfulness. One such metric is if tokens are truly important, then masking them should result in worse model performance. However, token masking introduces out-of-distribution issues and existing solutions are computationally expensive and employ proxy-models. Furthermore, other metrics are very limited in scope. In this work, we propose an inherently faithfulness measurable model that addresses these challenges. This is achieved by using a novel fine-tuning method that incorporates masking, such that masking tokens become in-distribution by design. This differs from existing approaches, which are completely model-agnostic but are inapplicable in practice. We demonstrate the generality of our approach by applying it to various tasks and validate it using statistical in-distribution tests. Additionally, because masking is in-distribution, importance measures which themselves use masking become more faithful, thus our model becomes more explainable.  ( 2 min )
    Using Spark Machine Learning Models to Perform Predictive Analysis on Flight Ticket Pricing Data. (arXiv:2310.07787v1 [cs.LG])
    This paper discusses predictive performance and processes undertaken on flight pricing data utilizing r2(r-square) and RMSE that leverages a large dataset, originally from Expedia.com, consisting of approximately 20 million records or 4.68 gigabytes. The project aims to determine the best models usable in the real world to predict airline ticket fares for non-stop flights across the US. Therefore, good generalization capability and optimized processing times are important measures for the model. We will discover key business insights utilizing feature importance and discuss the process and tools used for our analysis. Four regression machine learning algorithms were utilized: Random Forest, Gradient Boost Tree, Decision Tree, and Factorization Machines utilizing Cross Validator and Training Validator functions for assessing performance and generalization capability.  ( 2 min )
    Self-supervised Representation Learning From Random Data Projectors. (arXiv:2310.07756v1 [cs.LG])
    Self-supervised representation learning~(SSRL) has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities, and can conflict with application-specific data augmentation constraints. This paper presents an SSRL approach that can be applied to any data modality and network architecture because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on a wide range of representation learning tasks that span diverse modalities and real-world applications. We show that it outperforms multiple state-of-the-art SSRL baselines. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study.  ( 2 min )
    Parametric Leaky Tanh: A New Hybrid Activation Function for Deep Learning. (arXiv:2310.07720v1 [cs.LG])
    Activation functions (AFs) are crucial components of deep neural networks (DNNs), having a significant impact on their performance. An activation function in a DNN is typically a smooth, nonlinear function that transforms an input signal into an output signal for the subsequent layer. In this paper, we propose the Parametric Leaky Tanh (PLTanh), a novel hybrid activation function designed to combine the strengths of both the Tanh and Leaky ReLU (LReLU) activation functions. PLTanh is differentiable at all points and addresses the 'dying ReLU' problem by ensuring a non-zero gradient for negative inputs, consistent with the behavior of LReLU. By integrating the unique advantages of these two diverse activation functions, PLTanh facilitates the learning of more intricate nonlinear relationships within the network. This paper presents an empirical evaluation of PLTanh against established activation functions, namely ReLU, LReLU, and ALReLU utilizing five diverse datasets.  ( 2 min )
    Visual Forecasting as a Mid-level Representation for Avoidance. (arXiv:2310.07724v1 [cs.RO])
    The challenge of navigation in environments with dynamic objects continues to be a central issue in the study of autonomous agents. While predictive methods hold promise, their reliance on precise state information makes them less practical for real-world implementation. This study presents visual forecasting as an innovative alternative. By introducing intuitive visual cues, this approach projects the future trajectories of dynamic objects to improve agent perception and enable anticipatory actions. Our research explores two distinct strategies for conveying predictive information through visual forecasting: (1) sequences of bounding boxes, and (2) augmented paths. To validate the proposed visual forecasting strategies, we initiate evaluations in simulated environments using the Unity engine and then extend these evaluations to real-world scenarios to assess both practicality and effectiveness. The results confirm the viability of visual forecasting as a promising solution for navigation and obstacle avoidance in dynamic environments.  ( 2 min )
  • Open

    Feature Learning and Generalization in Deep Networks with Orthogonal Weights. (arXiv:2310.07765v1 [cs.LG])
    Fully-connected deep neural networks with weights initialized from independent Gaussian distributions can be tuned to criticality, which prevents the exponential growth or decay of signals propagating through the network. However, such networks still exhibit fluctuations that grow linearly with the depth of the network, which may impair the training of networks with width comparable to depth. We show analytically that rectangular networks with tanh activations and weights initialized from the ensemble of orthogonal matrices have corresponding preactivation fluctuations which are independent of depth, to leading order in inverse width. Moreover, we demonstrate numerically that, at initialization, all correlators involving the neural tangent kernel (NTK) and its descendants at leading order in inverse width -- which govern the evolution of observables during training -- saturate at a depth of $\sim 20$, rather than growing without bound as in the case of Gaussian initializations. We speculate that this structure preserves finite-width feature learning while reducing overall noise, thus improving both generalization and training speed. We provide some experimental justification by relating empirical measurements of the NTK to the superior performance of deep nonlinear orthogonal networks trained under full-batch gradient descent on the MNIST and CIFAR-10 classification tasks.  ( 2 min )
    LEMON: Lossless model expansion. (arXiv:2310.07999v1 [cs.LG])
    Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
    $L^1$ Estimation: On the Optimality of Linear Estimators. (arXiv:2309.09129v2 [math.ST] UPDATED)
    Consider the problem of estimating a random variable $X$ from noisy observations $Y = X+ Z$, where $Z$ is standard normal, under the $L^1$ fidelity criterion. It is well known that the optimal Bayesian estimator in this setting is the conditional median. This work shows that the only prior distribution on $X$ that induces linearity in the conditional median is Gaussian. Along the way, several other results are presented. In particular, it is demonstrated that if the conditional distribution $P_{X|Y=y}$ is symmetric for all $y$, then $X$ must follow a Gaussian distribution. Additionally, we consider other $L^p$ losses and observe the following phenomenon: for $p \in [1,2]$, Gaussian is the only prior distribution that induces a linear optimal Bayesian estimator, and for $p \in (2,\infty)$, infinitely many prior distributions on $X$ can induce linearity. Finally, extensions are provided to encompass noise models leading to conditional distributions from certain exponential families.
    A Symmetry-Aware Exploration of Bayesian Neural Network Posteriors. (arXiv:2310.08287v1 [stat.ML])
    The distribution of the weights of modern deep neural networks (DNNs) - crucial for uncertainty quantification and robustness - is an eminently complex object due to its extremely high dimensionality. This paper proposes one of the first large-scale explorations of the posterior distribution of deep Bayesian Neural Networks (BNNs), expanding its study to real-world vision tasks and architectures. Specifically, we investigate the optimal approach for approximating the posterior, analyze the connection between posterior quality and uncertainty quantification, delve into the impact of modes on the posterior, and explore methods for visualizing the posterior. Moreover, we uncover weight-space symmetries as a critical aspect for understanding the posterior. To this extent, we develop an in-depth assessment of the impact of both permutation and scaling symmetries that tend to obfuscate the Bayesian posterior. While the first type of transformation is known for duplicating modes, we explore the relationship between the latter and L2 regularization, challenging previous misconceptions. Finally, to help the community improve our understanding of the Bayesian posterior, we will shortly release the first large-scale checkpoint dataset, including thousands of real-world models and our codes.
    A general framework for multi-step ahead adaptive conformal heteroscedastic time series forecasting. (arXiv:2207.14219v9 [stat.ML] UPDATED)
    This paper introduces a novel model-agnostic algorithm called adaptive ensemble batch multi-input multi-output conformalized quantile regression (AEnbMIMOCQR} that enables forecasters to generate multi-step ahead prediction intervals for a fixed pre-specified miscoverage rate in a distribution-free manner. Our method is grounded on conformal prediction principles, however, it does not require data splitting and provides close to exact coverage even when the data is not exchangeable. Moreover, the resulting prediction intervals, besides being empirically valid along the forecast horizon, do not neglect heteroscedasticity. AEnbMIMOCQR is designed to be robust to distribution shifts, which means that its prediction intervals remain reliable over an unlimited period of time, without entailing retraining or imposing unrealistic strict assumptions on the data-generating process. Through methodically experimentation, we demonstrate that our approach outperforms other competitive methods on both real-world and synthetic datasets. The code used in the experimental part and a tutorial on how to use AEnbMIMOCQR can be found at the following GitHub repository: https://github.com/Quilograma/AEnbMIMOCQR.
    RandCom: Random Communication Skipping Method for Decentralized Stochastic Optimization. (arXiv:2310.07983v1 [cs.LG])
    Distributed optimization methods with random communication skips are gaining increasing attention due to their proven benefits in accelerating communication complexity. Nevertheless, existing research mainly focuses on centralized communication protocols for strongly convex deterministic settings. In this work, we provide a decentralized optimization method called RandCom, which incorporates probabilistic local updates. We analyze the performance of RandCom in stochastic non-convex, convex, and strongly convex settings and demonstrate its ability to asymptotically reduce communication overhead by the probability of communication. Additionally, we prove that RandCom achieves linear speedup as the number of nodes increases. In stochastic strongly convex settings, we further prove that RandCom can achieve linear speedup with network-independent stepsizes. Moreover, we apply RandCom to federated learning and provide positive results concerning the potential for achieving linear speedup and the suitability of the probabilistic local update approach for non-convex settings.  ( 2 min )
    Characterizing climate pathways using feature importance on echo state networks. (arXiv:2310.08495v1 [stat.ML])
    The 2022 National Defense Strategy of the United States listed climate change as a serious threat to national security. Climate intervention methods, such as stratospheric aerosol injection, have been proposed as mitigation strategies, but the downstream effects of such actions on a complex climate system are not well understood. The development of algorithmic techniques for quantifying relationships between source and impact variables related to a climate event (i.e., a climate pathway) would help inform policy decisions. Data-driven deep learning models have become powerful tools for modeling highly nonlinear relationships and may provide a route to characterize climate variable relationships. In this paper, we explore the use of an echo state network (ESN) for characterizing climate pathways. ESNs are a computationally efficient neural network variation designed for temporal data, and recent work proposes ESNs as a useful tool for forecasting spatio-temporal climate data. Like other neural networks, ESNs are non-interpretable black-box models, which poses a hurdle for understanding variable relationships. We address this issue by developing feature importance methods for ESNs in the context of spatio-temporal data to quantify variable relationships captured by the model. We conduct a simulation study to assess and compare the feature importance techniques, and we demonstrate the approach on reanalysis climate data. In the climate application, we select a time period that includes the 1991 volcanic eruption of Mount Pinatubo. This event was a significant stratospheric aerosol injection, which we use as a proxy for an artificial stratospheric aerosol injection. Using the proposed approach, we are able to characterize relationships between pathway variables associated with this event.  ( 3 min )
    Impact of multi-armed bandit strategies on deep recurrent reinforcement learning. (arXiv:2310.08331v1 [stat.ML])
    Incomplete knowledge of the environment leads an agent to make decisions under uncertainty. One of the major dilemmas in Reinforcement Learning (RL) where an autonomous agent has to balance two contrasting needs in making its decisions is: exploiting the current knowledge of the environment to maximize the cumulative reward as well as exploring actions that allow improving the knowledge of the environment, hopefully leading to higher reward values (exploration-exploitation trade-off). Concurrently, another relevant issue regards the full observability of the states, which may not be assumed in all applications. Such as when only 2D images are considered as input in a RL approach used for finding the optimal action within a 3D simulation environment. In this work, we address these issues by deploying and testing several techniques to balance exploration and exploitation trade-off on partially observable systems for predicting steering wheels in autonomous driving scenario. More precisely, the final aim is to investigate the effects of using both stochastic and deterministic multi-armed bandit strategies coupled with a Deep Recurrent Q-Network. Additionally, we adapted and evaluated the impact of an innovative method to improve the learning phase of the underlying Convolutional Recurrent Neural Network. We aim to show that adaptive stochastic methods for exploration better approximate the trade-off between exploration and exploitation as, in general, Softmax and Max-Boltzmann strategies are able to outperform epsilon-greedy techniques.  ( 2 min )
    Clustering Three-Way Data with Outliers. (arXiv:2310.05288v2 [stat.ML] UPDATED)
    Matrix-variate distributions are a recent addition to the model-based clustering field, thereby making it possible to analyze data in matrix form with complex structure such as images and time series. Due to its recent appearance, there is limited literature on matrix-variate data, with even less on dealing with outliers in these models. An approach for clustering matrix-variate normal data with outliers is discussed. The approach, which uses the distribution of subset log-likelihoods, extends the OCLUST algorithm to matrix-variate normal data and uses an iterative approach to detect and trim outliers.  ( 2 min )
    Contextualized Policy Recovery: Modeling and Interpreting Medical Decisions with Adaptive Imitation Learning. (arXiv:2310.07918v1 [cs.LG])
    Interpretable policy learning seeks to estimate intelligible decision policies from observed actions; however, existing models fall short by forcing a tradeoff between accuracy and interpretability. This tradeoff limits data-driven interpretations of human decision-making process. e.g. to audit medical decisions for biases and suboptimal practices, we require models of decision processes which provide concise descriptions of complex behaviors. Fundamentally, existing approaches are burdened by this tradeoff because they represent the underlying decision process as a universal policy, when in fact human decisions are dynamic and can change drastically with contextual information. Thus, we propose Contextualized Policy Recovery (CPR), which re-frames the problem of modeling complex decision processes as a multi-task learning problem in which complex decision policies are comprised of context-specific policies. CPR models each context-specific policy as a linear observation-to-action mapping, and generates new decision models $\textit{on-demand}$ as contexts are updated with new observations. CPR is compatible with fully offline and partially observable decision environments, and can be tailored to incorporate any recurrent black-box model or interpretable decision model. We assess CPR through studies on simulated and real data, achieving state-of-the-art performance on the canonical tasks of predicting antibiotic prescription in intensive care units ($+22\%$ AUROC vs. previous SOTA) and predicting MRI prescription for Alzheimer's patients ($+7.7\%$ AUROC vs. previous SOTA). With this improvement in predictive performance, CPR closes the accuracy gap between interpretable and black-box methods for policy learning, allowing high-resolution exploration and analysis of context-specific decision models.  ( 3 min )
    Understanding Sparse Feature Updates in Deep Networks using Iterative Linearisation. (arXiv:2211.12345v4 [cs.LG] UPDATED)
    Larger and deeper networks generalise well despite their increased capacity to overfit. Understanding why this happens is theoretically and practically important. One recent approach looks at the infinitely wide limits of such networks and their corresponding kernels. However, these theoretical tools cannot fully explain finite networks as the empirical kernel changes significantly during gradient-descent-based training in contrast to infinite networks. In this work, we derive an iterative linearised training method as a novel empirical tool to further investigate this distinction, allowing us to control for sparse (i.e. infrequent) feature updates and quantify the frequency of feature learning needed to achieve comparable performance. We justify iterative linearisation as an interpolation between a finite analog of the infinite width regime, which does not learn features, and standard gradient descent training, which does. Informally, we also show that it is analogous to a damped version of the Gauss-Newton algorithm -- a second-order method. We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training, noting in particular how much less frequent feature learning is required to achieve comparable performance. We also show that feature learning is essential for good performance. Since such feature learning inevitably causes changes in the NTK kernel, we provide direct negative evidence for the NTK theory, which states the NTK kernel remains constant during training.  ( 3 min )
    Limits of Model Selection under Transfer Learning. (arXiv:2305.00152v4 [stat.ML] UPDATED)
    Theoretical studies on transfer learning or domain adaptation have so far focused on situations with a known hypothesis class or model; however in practice, some amount of model selection is usually involved, often appearing under the umbrella term of hyperparameter-tuning: for example, one may think of the problem of tuning for the right neural network architecture towards a target task, while leveraging data from a related source task. Now, in addition to the usual tradeoffs on approximation vs estimation errors involved in model selection, this problem brings in a new complexity term, namely, the transfer distance between source and target distributions, which is known to vary with the choice of hypothesis class. We present a first study of this problem, focusing on classification; in particular, the analysis reveals some remarkable phenomena: adaptive rates, i.e., those achievable with no distributional information, can be arbitrarily slower than oracle rates, i.e., when given knowledge on distances.  ( 2 min )
    A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks. (arXiv:2310.07891v1 [stat.ML])
    Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the loss, we demonstrate that these non-linear features can enhance learning.  ( 2 min )
    Log-Gaussian Gamma Processes for Training Bayesian Neural Networks in Raman and CARS Spectroscopies. (arXiv:2310.08055v1 [stat.AP])
    We propose an approach utilizing gamma-distributed random variables, coupled with log-Gaussian modeling, to generate synthetic datasets suitable for training neural networks. This addresses the challenge of limited real observations in various applications. We apply this methodology to both Raman and coherent anti-Stokes Raman scattering (CARS) spectra, using experimental spectra to estimate gamma process parameters. Parameter estimation is performed using Markov chain Monte Carlo methods, yielding a full Bayesian posterior distribution for the model which can be sampled for synthetic data generation. Additionally, we model the additive and multiplicative background functions for Raman and CARS with Gaussian processes. We train two Bayesian neural networks to estimate parameters of the gamma process which can then be used to estimate the underlying Raman spectrum and simultaneously provide uncertainty through the estimation of parameters of a probability distribution. We apply the trained Bayesian neural networks to experimental Raman spectra of phthalocyanine blue, aniline black, naphthol red, and red 264 pigments and also to experimental CARS spectra of adenosine phosphate, fructose, glucose, and sucrose. The results agree with deterministic point estimates for the underlying Raman and CARS spectral signatures.  ( 2 min )
    Learning to Act from Actionless Videos through Dense Correspondences. (arXiv:2310.08576v1 [cs.RO])
    In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that ``hallucinate'' robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day.  ( 2 min )
    A Complete Recipe for Diffusion Generative Models. (arXiv:2303.01748v2 [cs.LG] UPDATED)
    Score-based Generative Models (SGMs) have demonstrated exceptional synthesis outcomes across various tasks. However, the current design landscape of the forward diffusion process remains largely untapped and often relies on physical heuristics or simplifying assumptions. Utilizing insights from the development of scalable Bayesian posterior samplers, we present a complete recipe for formulating forward processes in SGMs, ensuring convergence to the desired target distribution. Our approach reveals that several existing SGMs can be seen as specific manifestations of our framework. Building upon this method, we introduce Phase Space Langevin Diffusion (PSLD), which relies on score-based modeling within an augmented space enriched by auxiliary variables akin to physical phase space. Empirical results exhibit the superior sample quality and improved speed-quality trade-off of PSLD compared to various competing approaches on established image synthesis benchmarks. Remarkably, PSLD achieves sample quality akin to state-of-the-art SGMs (FID: 2.10 for unconditional CIFAR-10 generation). Lastly, we demonstrate the applicability of PSLD in conditional synthesis using pre-trained score networks, offering an appealing alternative as an SGM backbone for future advancements. Code and model checkpoints can be accessed at \url{https://github.com/mandt-lab/PSLD}.  ( 2 min )
    Smoothed $f$-Divergence Distributionally Robust Optimization. (arXiv:2306.14041v2 [math.OC] UPDATED)
    In data-driven optimization, sample average approximation (SAA) is known to suffer from the so-called optimizer's curse that causes an over-optimistic evaluation of the solution performance. We argue that a special type of distributionallly robust optimization (DRO) formulation offers theoretical advantages in correcting for this optimizer's curse compared to simple ``margin'' adjustments to SAA and other DRO approaches: It attains a statistical bound on the out-of-sample performance, for a wide class of objective functions and distributions, that is nearly tightest in terms of exponential decay rate. This DRO uses an ambiguity set based on a Kullback Leibler (KL) divergence smoothed by the Wasserstein or L\'evy-Prokhorov (LP) distance via a suitable distance optimization. Computationally, we also show that such a DRO, and its generalized versions using smoothed $f$-divergence, are not harder than DRO problems based on $f$-divergence or Wasserstein distances, rendering our DRO formulations both statistically optimal and computationally viable.  ( 2 min )
    On Regularized Sparse Logistic Regression. (arXiv:2309.05925v2 [cs.LG] UPDATED)
    Sparse logistic regression is for classification and feature selection simultaneously. Although many studies have been done to solve $\ell_1$-regularized logistic regression, there is no equivalently abundant work on solving sparse logistic regression with nonconvex regularization term. In this paper, we propose a unified framework to solve $\ell_1$-regularized logistic regression, which can be naturally extended to nonconvex regularization term, as long as certain requirement is satisfied. In addition, we also utilize a different line search criteria to guarantee monotone convergence for various regularization terms. Empirical experiments on binary classification tasks with real-world datasets demonstrate our proposed algorithms are capable of performing classification and feature selection effectively at a lower computational cost.  ( 2 min )
    When, Why and How Much? Adaptive Learning Rate Scheduling by Refinement. (arXiv:2310.07831v1 [cs.LG])
    Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our key technical contribution is a refined analysis of learning rate schedules for a wide class of optimization algorithms (including SGD). In contrast to most prior works that study the convergence of the average iterate, we study the last iterate, which is what most people use in practice. When considering only worst-case analysis, our theory predicts that the best choice is the linear decay schedule: a popular choice in practice that sets the stepsize proportionally to $1 - t/T$, where $t$ is the current iteration and $T$ is the total number of steps. To go beyond this worst-case analysis, we use the observed gradient norms to derive schedules refined for any particular task. These refined schedules exhibit learning rate warm-up and rapid learning rate annealing near the end of training. Ours is the first systematic approach to automatically yield both of these properties. We perform the most comprehensive evaluation of learning rate schedules to date, evaluating across 10 diverse deep learning problems, a series of LLMs, and a suite of logistic regression problems. We validate that overall, the linear-decay schedule matches or outperforms all commonly used default schedules including cosine annealing, and that our schedule refinement method gives further improvements.  ( 3 min )
    Differentially Private Non-convex Learning for Multi-layer Neural Networks. (arXiv:2310.08425v1 [cs.LG])
    This paper focuses on the problem of Differentially Private Stochastic Optimization for (multi-layer) fully connected neural networks with a single output node. In the first part, we examine cases with no hidden nodes, specifically focusing on Generalized Linear Models (GLMs). We investigate the well-specific model where the random noise possesses a zero mean, and the link function is both bounded and Lipschitz continuous. We propose several algorithms and our analysis demonstrates the feasibility of achieving an excess population risk that remains invariant to the data dimension. We also delve into the scenario involving the ReLU link function, and our findings mirror those of the bounded link function. We conclude this section by contrasting well-specified and misspecified models, using ReLU regression as a representative example. In the second part of the paper, we extend our ideas to two-layer neural networks with sigmoid or ReLU activation functions in the well-specified model. In the third part, we study the theoretical guarantees of DP-SGD in Abadi et al. (2016) for fully connected multi-layer neural networks. By utilizing recent advances in Neural Tangent Kernel theory, we provide the first excess population risk when both the sample size and the width of the network are sufficiently large. Additionally, we discuss the role of some parameters in DP-SGD regarding their utility, both theoretically and empirically.  ( 2 min )
    Conditional Sig-Wasserstein GANs for Time Series Generation. (arXiv:2006.05421v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) have been extremely successful in generating samples, from seemingly high dimensional probability measures. However, these methods struggle to capture the temporal dependence of joint probability distributions induced by time-series data. Furthermore, long time-series data streams hugely increase the dimension of the target space, which may render generative modelling infeasible. To overcome these challenges, motivated by the autoregressive models in econometric, we are interested in the conditional distribution of future time series given the past information. We propose the generic conditional Sig-WGAN framework by integrating Wasserstein-GANs (WGANs) with mathematically principled and efficient path feature extraction called the signature of a path. The signature of a path is a graded sequence of statistics that provides a universal description for a stream of data, and its expected value characterises the law of the time-series model. In particular, we develop the conditional Sig-$W_1$ metric, that captures the conditional joint law of time series models, and use it as a discriminator. The signature feature space enables the explicit representation of the proposed discriminators which alleviates the need for expensive training. We validate our method on both synthetic and empirical dataset and observe that our method consistently and significantly outperforms state-of-the-art benchmarks with respect to measures of similarity and predictive ability.  ( 3 min )
    An interpretable neural network-based non-proportional odds model for ordinal regression. (arXiv:2303.17823v3 [stat.ME] UPDATED)
    This study proposes an interpretable neural network-based non-proportional odds model (N$^3$POM) for ordinal regression. N$^3$POM is different from conventional approaches to ordinal regression with non-proportional models in several ways: (1) N$^3$POM is designed to directly handle continuous responses, whereas standard methods typically treat de facto ordered continuous variables as discrete, (2) instead of estimating response-dependent finite coefficients of linear models from discrete responses as is done in conventional approaches, we train a non-linear neural network to serve as a coefficient function. Thanks to the neural network, N$^3$POM offers flexibility while preserving the interpretability of conventional ordinal regression. We establish a sufficient condition under which the predicted conditional cumulative probability locally satisfies the monotonicity constraint over a user-specified region in the covariate space. Additionally, we provide a monotonicity-preserving stochastic (MPS) algorithm for effectively training the neural network. We apply N$^3$POM to several real-world datasets.  ( 2 min )
    Generalization bounds for neural ordinary differential equations and deep residual networks. (arXiv:2305.06648v2 [stat.ML] UPDATED)
    Neural ordinary differential equations (neural ODEs) are a popular family of continuous-depth deep learning models. In this work, we consider a large family of parameterized ODEs with continuous-in-time parameters, which include time-dependent neural ODEs. We derive a generalization bound for this class by a Lipschitz-based argument. By leveraging the analogy between neural ODEs and deep residual networks, our approach yields in particular a generalization bound for a class of deep residual networks. The bound involves the magnitude of the difference between successive weight matrices. We illustrate numerically how this quantity affects the generalization capability of neural networks.  ( 2 min )
    NECO: NEural Collapse Based Out-of-distribution detection. (arXiv:2310.06823v2 [stat.ML] UPDATED)
    Detecting out-of-distribution (OOD) data is a critical challenge in machine learning due to model overconfidence, often without awareness of their epistemological limits. We hypothesize that ``neural collapse'', a phenomenon affecting in-distribution data for models trained beyond loss convergence, also influences OOD data. To benefit from this interplay, we introduce NECO, a novel post-hoc method for OOD detection, which leverages the geometric properties of ``neural collapse'' and of principal component spaces to identify OOD data. Our extensive experiments demonstrate that NECO achieves state-of-the-art results on both small and large-scale OOD detection tasks while exhibiting strong generalization capabilities across different network architectures. Furthermore, we provide a theoretical explanation for the effectiveness of our method in OOD detection. We plan to release the code after the anonymity period.  ( 2 min )
    Variational Imbalanced Regression: Fair Uncertainty Quantification via Probabilistic Smoothing. (arXiv:2306.06599v4 [cs.LG] UPDATED)
    Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I.I.D. representations (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. Code will soon be available at \url{https://github.com/Wang-ML-Lab/variational-imbalanced-regression}.  ( 2 min )
    Hyperparameter Adaptive Search for Surrogate Optimization: A Self-Adjusting Approach. (arXiv:2310.07970v1 [cs.LG])
    Surrogate Optimization (SO) algorithms have shown promise for optimizing expensive black-box functions. However, their performance is heavily influenced by hyperparameters related to sampling and surrogate fitting, which poses a challenge to their widespread adoption. We investigate the impact of hyperparameters on various SO algorithms and propose a Hyperparameter Adaptive Search for SO (HASSO) approach. HASSO is not a hyperparameter tuning algorithm, but a generic self-adjusting SO algorithm that dynamically tunes its own hyperparameters while concurrently optimizing the primary objective function, without requiring additional evaluations. The aim is to improve the accessibility, effectiveness, and convergence speed of SO algorithms for practitioners. Our approach identifies and modifies the most influential hyperparameters specific to each problem and SO approach, reducing the need for manual tuning without significantly increasing the computational burden. Experimental results demonstrate the effectiveness of HASSO in enhancing the performance of various SO algorithms across different global optimization test problems.  ( 2 min )
    Robust 1-bit Compressed Sensing with Iterative Hard Thresholding. (arXiv:2310.08019v1 [cs.IT])
    In 1-bit compressed sensing, the aim is to estimate a $k$-sparse unit vector $x\in S^{n-1}$ within an $\epsilon$ error (in $\ell_2$) from minimal number of linear measurements that are quantized to just their signs, i.e., from measurements of the form $y = \mathrm{Sign}(\langle a, x\rangle).$ In this paper, we study a noisy version where a fraction of the measurements can be flipped, potentially by an adversary. In particular, we analyze the Binary Iterative Hard Thresholding (BIHT) algorithm, a proximal gradient descent on a properly defined loss function used for 1-bit compressed sensing, in this noisy setting. It is known from recent results that, with $\tilde{O}(\frac{k}{\epsilon})$ noiseless measurements, BIHT provides an estimate within $\epsilon$ error. This result is optimal and universal, meaning one set of measurements work for all sparse vectors. In this paper, we show that BIHT also provides better results than all known methods for the noisy setting. We show that when up to $\tau$-fraction of the sign measurements are incorrect (adversarial error), with the same number of measurements as before, BIHT agnostically provides an estimate of $x$ within an $\tilde{O}(\epsilon+\tau)$ error, maintaining the universality of measurements. This establishes stability of iterative hard thresholding in the presence of measurement error. To obtain the result, we use the restricted approximate invertibility of Gaussian matrices, as well as a tight analysis of the high-dimensional geometry of the adversarially corrupted measurements.  ( 3 min )
    Efficient probabilistic reconciliation of forecasts for real-valued and count time series. (arXiv:2210.02286v3 [stat.ML] UPDATED)
    Hierarchical time series are common in several applied fields. The forecasts for these time series are required to be coherent, that is, to satisfy the constraints given by the hierarchy. The most popular technique to enforce coherence is called reconciliation, which adjusts the base forecasts computed for each time series. However, recent works on probabilistic reconciliation present several limitations. In this paper, we propose a new approach based on conditioning to reconcile any type of forecast distribution. We then introduce a new algorithm, called Bottom-Up Importance Sampling, to efficiently sample from the reconciled distribution. It can be used for any base forecast distribution: discrete, continuous, or in the form of samples, providing a major speedup compared to the current methods. Experiments on several temporal hierarchies show a significant improvement over base probabilistic forecasts.  ( 2 min )
    Memorization with neural nets: going beyond the worst case. (arXiv:2310.00327v2 [stat.ML] UPDATED)
    In practice, deep neural networks are often able to easily interpolate their training data. To understand this phenomenon, many works have aimed to quantify the memorization capacity of a neural network architecture: the largest number of points such that the architecture can interpolate any placement of these points with any assignment of labels. For real-world data, however, one intuitively expects the presence of a benign structure so that interpolation already occurs at a smaller network size than suggested by memorization capacity. In this paper, we investigate interpolation by adopting an instance-specific viewpoint. We introduce a simple randomized algorithm that, given a fixed finite dataset with two classes, with high probability constructs an interpolating three-layer neural network in polynomial time. The required number of parameters is linked to geometric properties of the two classes and their mutual arrangement. As a result, we obtain guarantees that are independent of the number of samples and hence move beyond worst-case memorization capacity bounds. We illustrate the effectiveness of the algorithm in non-pathological situations with extensive numerical experiments and link the insights back to the theoretical results.  ( 2 min )
    Model-Agnostic Covariate-Assisted Inference on Partially Identified Causal Effects. (arXiv:2310.08115v1 [econ.EM])
    Many causal estimands are only partially identifiable since they depend on the unobservable joint distribution between potential outcomes. Stratification on pretreatment covariates can yield sharper partial identification bounds; however, unless the covariates are discrete with relatively small support, this approach typically requires consistent estimation of the conditional distributions of the potential outcomes given the covariates. Thus, existing approaches may fail under model misspecification or if consistency assumptions are violated. In this study, we propose a unified and model-agnostic inferential approach for a wide class of partially identified estimands, based on duality theory for optimal transport problems. In randomized experiments, our approach can wrap around any estimates of the conditional distributions and provide uniformly valid inference, even if the initial estimates are arbitrarily inaccurate. Also, our approach is doubly robust in observational studies. Notably, this property allows analysts to use the multiplier bootstrap to select covariates and models without sacrificing validity even if the true model is not included. Furthermore, if the conditional distributions are estimated at semiparametric rates, our approach matches the performance of an oracle with perfect knowledge of the outcome model. Finally, we propose an efficient computational framework, enabling implementation on many practical problems in causal inference.  ( 2 min )
    Generative modeling of time-dependent densities via optimal transport and projection pursuit. (arXiv:2304.09663v2 [stat.ML] UPDATED)
    Motivated by the computational difficulties incurred by popular deep learning algorithms for the generative modeling of temporal densities, we propose a cheap alternative which requires minimal hyperparameter tuning and scales favorably to high dimensional problems. In particular, we use a projection-based optimal transport solver [Meng et al., 2019] to join successive samples and subsequently use transport splines [Chewi et al., 2020] to interpolate the evolving density. When the sampling frequency is sufficiently high, the optimal maps are close to the identity and are thus computationally efficient to compute. Moreover, the training process is highly parallelizable as all optimal maps are independent and can thus be learned simultaneously. Finally, the approach is based solely on numerical linear algebra rather than minimizing a nonconvex objective function, allowing us to easily analyze and control the algorithm. We present several numerical experiments on both synthetic and real-world datasets to demonstrate the efficiency of our method. In particular, these experiments show that the proposed approach is highly competitive compared with state-of-the-art normalizing flows conditioned on time across a wide range of dimensionalities.  ( 3 min )
    Online RL in Linearly $q^\pi$-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore. (arXiv:2310.07811v1 [cs.LG])
    We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.  ( 3 min )
    Conformal inference for regression on Riemannian Manifolds. (arXiv:2310.08209v1 [stat.ML])
    Regression on manifolds, and, more broadly, statistics on manifolds, has garnered significant importance in recent years due to the vast number of applications for this type of data. Circular data is a classic example, but so is data in the space of covariance matrices, data on the Grassmannian manifold obtained as a result of principal component analysis, among many others. In this work we investigate prediction sets for regression scenarios when the response variable, denoted by $Y$, resides in a manifold, and the covariable, denoted by X, lies in Euclidean space. This extends the concepts delineated in [Lei and Wasserman, 2014] to this novel context. Aligning with traditional principles in conformal inference, these prediction sets are distribution-free, indicating that no specific assumptions are imposed on the joint distribution of $(X, Y)$, and they maintain a non-parametric character. We prove the asymptotic almost sure convergence of the empirical version of these regions on the manifold to their population counterparts. The efficiency of this method is shown through a comprehensive simulation study and an analysis involving real-world data.  ( 2 min )
    On Extreme Value Asymptotics of Projected Sample Covariances in High Dimensions with Applications in Finance and Convolutional Networks. (arXiv:2310.08150v1 [math.ST])
    Maximum-type statistics of certain functions of the sample covariance matrix of high-dimensional vector time series are studied to statistically confirm or reject the null hypothesis that a data set has been collected under normal conditions. The approach generalizes the case of the maximal deviation of the sample autocovariances function from its assumed values. Within a linear time series framework it is shown that Gumbel-type extreme value asymptotics holds true. As applications we discuss long-only mimimal-variance portfolio optimization and subportfolio analysis with respect to idiosyncratic risks, ETF index tracking by sparse tracking portfolios, convolutional deep learners for image analysis and the analysis of array-of-sensors data.  ( 2 min )
    Local Graph Clustering with Noisy Labels. (arXiv:2310.08031v1 [cs.LG])
    The growing interest in machine learning problems over graphs with additional node information such as texts, images, or labels has popularized methods that require the costly operation of processing the entire graph. Yet, little effort has been made to the development of fast local methods (i.e. without accessing the entire graph) that extract useful information from such data. To that end, we propose a study of local graph clustering using noisy node labels as a proxy for additional node information. In this setting, nodes receive initial binary labels based on cluster affiliation: 1 if they belong to the target cluster and 0 otherwise. Subsequently, a fraction of these labels is flipped. We investigate the benefits of incorporating noisy labels for local graph clustering. By constructing a weighted graph with such labels, we study the performance of graph diffusion-based local clustering method on both the original and the weighted graphs. From a theoretical perspective, we consider recovering an unknown target cluster with a single seed node in a random graph with independent noisy node labels. We provide sufficient conditions on the label noise under which, with high probability, using diffusion in the weighted graph yields a more accurate recovery of the target cluster. This approach proves more effective than using the given labels alone or using diffusion in the label-free original graph. Empirically, we show that reliable node labels can be obtained with just a few samples from an attributed graph. Moreover, utilizing these labels via diffusion in the weighted graph leads to significantly better local clustering performance across several real-world datasets, improving F1 scores by up to 13%.  ( 3 min )
    Evaluation of ChatGPT-Generated Medical Responses: A Systematic Review and Meta-Analysis. (arXiv:2310.08410v1 [stat.ME])
    Large language models such as ChatGPT are increasingly explored in medical domains. However, the absence of standard guidelines for performance evaluation has led to methodological inconsistencies. This study aims to summarize the available evidence on evaluating ChatGPT's performance in medicine and provide direction for future research. We searched ten medical literature databases on June 15, 2023, using the keyword "ChatGPT". A total of 3520 articles were identified, of which 60 were reviewed and summarized in this paper and 17 were included in the meta-analysis. The analysis showed that ChatGPT displayed an overall integrated accuracy of 56% (95% CI: 51%-60%, I2 = 87%) in addressing medical queries. However, the studies varied in question resource, question-asking process, and evaluation metrics. Moreover, many studies failed to report methodological details, including the version of ChatGPT and whether each question was used independently or repeatedly. Our findings revealed that although ChatGPT demonstrated considerable potential for application in healthcare, the heterogeneity of the studies and insufficient reporting may affect the reliability of these results. Further well-designed studies with comprehensive and transparent reporting are needed to evaluate ChatGPT's performance in medicine.  ( 2 min )
    Extensions of Heterogeneity in Integration and Prediction (HIP) with R Shiny Application. (arXiv:2310.08426v1 [stat.ME])
    Multiple data views measured on the same set of participants is becoming more common and has the potential to deepen our understanding of many complex diseases by analyzing these different views simultaneously. Equally important, many of these complex diseases show evidence of subgroup heterogeneity (e.g., by sex or race). HIP (Heterogeneity in Integration and Prediction) is among the first methods proposed to integrate multiple data views while also accounting for subgroup heterogeneity to identify common and subgroup-specific markers of a particular disease. However, HIP is applicable to continuous outcomes and requires programming expertise by the user. Here we propose extensions to HIP that accommodate multi-class, Poisson, and Zero-Inflated Poisson outcomes while retaining the benefits of HIP. Additionally, we introduce an R Shiny application, accessible on shinyapps.io at https://multi-viewlearn.shinyapps.io/HIP_ShinyApp/, that provides an interface with the Python implementation of HIP to allow more researchers to use the method anywhere and on any device. We applied HIP to identify genes and proteins common and specific to males and females that are associated with exacerbation frequency. Although some of the identified genes and proteins show evidence of a relationship with chronic obstructive pulmonary disease (COPD) in existing literature, others may be candidates for future research investigating their relationship with COPD. We demonstrate the use of the Shiny application with a publicly available data. An R-package for HIP would be made available at https://github.com/lasandrall/HIP.  ( 3 min )
    Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts. (arXiv:2310.05898v2 [cs.LG] UPDATED)
    Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.  ( 3 min )
    Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining. (arXiv:2310.08566v1 [cs.LG])
    Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods -- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories.  ( 2 min )
    Variable Selection for Kernel Two-Sample Tests. (arXiv:2302.07415v3 [stat.ML] UPDATED)
    We consider the variable selection problem for two-sample tests, aiming to select the most informative variables to distinguish samples from two groups. To solve this problem, we propose a framework based on the kernel maximum mean discrepancy (MMD). Our approach seeks a group of variables with a pre-specified size that maximizes the variance-regularized MMD statistics. This formulation also corresponds to the minimization of asymptotic type-II error while controlling type-I error, as studied in the literature. We present mixed-integer programming formulations and develop exact and approximation algorithms with performance guarantees for different choices of kernel functions. Furthermore, we provide a statistical testing power analysis of our proposed framework. Experiment results on synthetic and real datasets demonstrate the superior performance of our approach.  ( 2 min )
    Lattice real-time simulations with learned optimal kernels. (arXiv:2310.08053v1 [hep-lat])
    We present a simulation strategy for the real-time dynamics of quantum fields, inspired by reinforcement learning. It builds on the complex Langevin approach, which it amends with system specific prior information, a necessary prerequisite to overcome this exceptionally severe sign problem. The optimization process underlying our machine learning approach is made possible by deploying inherently stable solvers of the complex Langevin stochastic process and a novel optimality criterion derived from insight into so-called boundary terms. This conceptual and technical progress allows us to both significantly extend the range of real-time simulations in 1+1d scalar field theory beyond the state-of-the-art and to avoid discretization artifacts that plagued previous real-time field theory simulations. Limitations of and promising future directions are discussed.  ( 2 min )
    On the Computational Complexity of Private High-dimensional Model Selection via the Exponential Mechanism. (arXiv:2310.07852v1 [stat.ML])
    We consider the problem of model selection in a high-dimensional sparse linear regression model under the differential privacy framework. In particular, we consider the problem of differentially private best subset selection and study its utility guarantee. We adopt the well-known exponential mechanism for selecting the best model, and under a certain margin condition, we establish its strong model recovery property. However, the exponential search space of the exponential mechanism poses a serious computational bottleneck. To overcome this challenge, we propose a Metropolis-Hastings algorithm for the sampling step and establish its polynomial mixing time to its stationary distribution in the problem parameters $n,p$, and $s$. Furthermore, we also establish approximate differential privacy for the final estimates of the Metropolis-Hastings random walk using its mixing property. Finally, we also perform some illustrative simulations that echo the theoretical findings of our main results.  ( 2 min )
    Learning Regularized Monotone Graphon Mean-Field Games. (arXiv:2310.08089v1 [cs.GT])
    This paper studies two fundamental problems in regularized Graphon Mean-Field Games (GMFGs). First, we establish the existence of a Nash Equilibrium (NE) of any $\lambda$-regularized GMFG (for $\lambda\geq 0$). This result relies on weaker conditions than those in previous works for analyzing both unregularized GMFGs ($\lambda=0$) and $\lambda$-regularized MFGs, which are special cases of GMFGs. Second, we propose provably efficient algorithms to learn the NE in weakly monotone GMFGs, motivated by Lasry and Lions [2007]. Previous literature either only analyzed continuous-time algorithms or required extra conditions to analyze discrete-time algorithms. In contrast, we design a discrete-time algorithm and derive its convergence rate solely under weakly monotone conditions. Furthermore, we develop and analyze the action-value function estimation procedure during the online learning process, which is absent from algorithms for monotone GMFGs. This serves as a sub-module in our optimization algorithm. The efficiency of the designed algorithm is corroborated by empirical evaluations.  ( 2 min )
    Efficient Integrators for Diffusion Generative Models. (arXiv:2310.07894v1 [cs.LG])
    Diffusion models suffer from slow sample generation at inference time. Therefore, developing a principled framework for fast deterministic/stochastic sampling for a broader class of diffusion models is a promising direction. We propose two complementary frameworks for accelerating sample generation in pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate integrators generalize DDIM, mapping the reverse diffusion dynamics to a more amenable space for sampling. In contrast, splitting-based integrators, commonly used in molecular dynamics, reduce the numerical simulation error by cleverly alternating between numerical updates involving the data and auxiliary variables. After extensively studying these methods empirically and theoretically, we present a hybrid method that leads to the best-reported performance for diffusion models in augmented spaces. Applied to Phase Space Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing baselines, respectively. Our code and model checkpoints will be made publicly available at \url{https://github.com/mandt-lab/PSLD}.  ( 2 min )
    How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?. (arXiv:2310.08391v1 [stat.ML])
    Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities, enabling them to solve unseen tasks solely based on input contexts without adjusting model parameters. In this paper, we study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression with a Gaussian prior. We establish a statistical task complexity bound for the attention model pretraining, showing that effective pretraining only requires a small number of independent tasks. Furthermore, we prove that the pretrained model closely matches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, by achieving nearly Bayes optimal risk on unseen tasks under a fixed context length. These theoretical findings complement prior experimental research and shed light on the statistical foundations of ICL.  ( 2 min )
    Personalised dynamic super learning: an application in predicting hemodiafiltration's convection volumes. (arXiv:2310.08479v1 [stat.ME])
    Obtaining continuously updated predictions is a major challenge for personalised medicine. Leveraging combinations of parametric regressions and machine learning approaches, the personalised online super learner (POSL) can achieve such dynamic and personalised predictions. We adapt POSL to predict a repeated continuous outcome dynamically and propose a new way to validate such personalised or dynamic prediction models. We illustrate its performance by predicting the convection volume of patients undergoing hemodiafiltration. POSL outperformed its candidate learners with respect to median absolute error, calibration-in-the-large, discrimination, and net benefit. We finally discuss the choices and challenges underlying the use of POSL.  ( 2 min )
    L2P: Learning to Place for Estimating Heavy-Tailed Distributed Outcomes. (arXiv:1908.04628v3 [cs.LG] UPDATED)
    Many real-world prediction tasks have outcome variables that have characteristic heavy-tail distributions. Examples include copies of books sold, auction prices of art pieces, demand for commodities in warehouses, etc. By learning heavy-tailed distributions, "big and rare" instances (e.g., the best-sellers) will have accurate predictions. Most existing approaches are not dedicated to learning heavy-tailed distribution; thus, they heavily under-predict such instances. To tackle this problem, we introduce Learning to Place (L2P), which exploits the pairwise relationships between instances for learning. In its training phase, L2P learns a pairwise preference classifier: is instance A > instance B? In its placing phase, L2P obtains a prediction by placing the new instance among the known instances. Based on its placement, the new instance is then assigned a value for its outcome variable. Experiments on real data show that L2P outperforms competing approaches in terms of accuracy and ability to reproduce heavy-tailed outcome distribution. In addition, L2P provides an interpretable model by placing each predicted instance in relation to its comparable neighbors. Interpretable models are highly desirable when lives and treasure are at stake.  ( 3 min )
    Statistical Performance Guarantee for Selecting Those Predicted to Benefit Most from Treatment. (arXiv:2310.07973v1 [stat.ME])
    Across a wide array of disciplines, many researchers use machine learning (ML) algorithms to identify a subgroup of individuals, called exceptional responders, who are likely to be helped by a treatment the most. A common approach consists of two steps. One first estimates the conditional average treatment effect or its proxy using an ML algorithm. They then determine the cutoff of the resulting treatment prioritization score to select those predicted to benefit most from the treatment. Unfortunately, these estimated treatment prioritization scores are often biased and noisy. Furthermore, utilizing the same data to both choose a cutoff value and estimate the average treatment effect among the selected individuals suffer from a multiple testing problem. To address these challenges, we develop a uniform confidence band for experimentally evaluating the sorted average treatment effect (GATES) among the individuals whose treatment prioritization score is at least as high as any given quantile value, regardless of how the quantile is chosen. This provides a statistical guarantee that the GATES for the selected subgroup exceeds a certain threshold. The validity of the proposed methodology depends solely on randomization of treatment and random sampling of units without requiring modeling assumptions or resampling methods. This widens its applicability including a wide range of other causal quantities. A simulation study shows that the empirical coverage of the proposed uniform confidence bands is close to the nominal coverage when the sample is as small as 100. We analyze a clinical trial of late-stage prostate cancer and find a relatively large proportion of exceptional responders with a statistical performance guarantee.  ( 3 min )
    Towards a Unified Analysis of Kernel-based Methods Under Covariate Shift. (arXiv:2310.08237v1 [stat.ML])
    Covariate shift occurs prevalently in practice, where the input distributions of the source and target data are substantially different. Despite its practical importance in various learning problems, most of the existing methods only focus on some specific learning tasks and are not well validated theoretically and numerically. To tackle this problem, we propose a unified analysis of general nonparametric methods in a reproducing kernel Hilbert space (RKHS) under covariate shift. Our theoretical results are established for a general loss belonging to a rich loss function family, which includes many commonly used methods as special cases, such as mean regression, quantile regression, likelihood-based classification, and margin-based classification. Two types of covariate shift problems are the focus of this paper and the sharp convergence rates are established for a general loss function to provide a unified theoretical analysis, which concurs with the optimal results in literature where the squared loss is used. Extensive numerical studies on synthetic and real examples confirm our theoretical findings and further illustrate the effectiveness of our proposed method.  ( 2 min )
    Towards the Fundamental Limits of Knowledge Transfer over Finite Domains. (arXiv:2310.07838v1 [cs.LG])
    We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal S$ over labels $\mathcal A$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{|{\mathcal S}||{\mathcal A}|}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on ${\mathcal A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${|{\mathcal S}|}/{n}$ free of $|{\mathcal A}|$. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.  ( 2 min )

  • Open

    Savage Dall-e 3 delivers "Average reddit post"
    submitted by /u/Zimmax [link] [comments]
    AI — weekly megathread!
    News provided by aibrews.com Researchers present LLark: A Multimodal Foundation Model for Music - an open-source instruction-tuned multimodal model for music understanding. LLark is trained entirely from open-source music data and models [Demo | Paper] Researchers released LLaVA-1.5. LLaVA (Large Language and Vision Assistant) is an open-source large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. LLaVA-1.5 achieved SoTA on 11 benchmarks, with just simple modifications to the original LLaVA and completed training in ~1 day on a single 8-A100 node [Demo | Paper | GitHub]. Voice AI platform ElevenLabs released AI Dubbing tool that enables users to automatically translate any audio in a video into a different language whil…
    The AI Boom Could Use a Shocking Amount of Electricity
    The rapid growth of artificial intelligence (AI) could lead to a significant increase in global electricity consumption, according to a peer-reviewed analysis published in Joule. The analysis estimates that if current trends continue, AI could drive the demand for electricity in data centers to consume at least 85.4 terawatt-hours annually, which is more than what many small countries use in a year. AI is energy-intensive, with both the training and inference phases requiring a significant amount of energy. The size of AI models, such as large language models, and the location of data centers also contribute to energy usage. Factors such as cooling requirements and the type of hardware used can impact energy consumption. Source : https://www.scientificamerican.com/article/the-ai-boom-could-use-a-shocking-amount-of-electricity/ submitted by /u/NuseAI [link] [comments]
    Lemur: Harmonizing Natural Language and Code for Language Agents
    Today's conversational bots like Claude and GPT can chat impressively but aren't great at complex planning or executing technical tasks. To overcome this, new research from HKU builds open-source AI agents that blend natural language and coding skills. They're called Lemur and Lemur-Chat. The researchers think achieving versatile real-world agents requires models that integrate both fluid natural language abilities and precise programming language control. Humans combine plain speech for higher-level goals with languages like Python when we need to plan intricately and execute exactly. AI needs both capacities too. But most existing models specialize in pure language or pure code. There's a separation that is limiting. The team created Lemur by pretraining the open-source Llama-2 on a massive mixed corpus with 10x more natural language than code. This improved its programming abilities while retaining conversational strength. Further instruction tuning optimized Lemur-Chat for following free-form directions in language. Experiments found Lemur surpassed specialized coding-only models like Codex in overall benchmarks. Lemur-Chat then exceeded Lemur by 15% after instruction tuning. More importantly, Lemur-Chat won 12/13 new "agent tests" designed to mimic real-world challenges needing both language and programming prowess. It beat alternatives at: Using tools like Python and Wikipedia to enhance reasoning Debugging code by leveraging error messages Improving the most from natural language feedback Exploring partially observable environments like cybersecurity and web browsing simulations. Lemur-Chat matched GPT-3.5 in many tests, closing the gap between commercial and open-source agents. TLDR: New open-source AI agents combine coding and language skills. Experiments show the combo unlocks more performance across technical challenges. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]
    Henry Kissinger: The Path to AI Arms Control
    submitted by /u/ForeignAffairsMag [link] [comments]
    A 21-year-old won $40,000 for using AI to read the first word on a 2,000-year-old papyrus scroll buried by Mount Vesuvius
    submitted by /u/thisisinsider [link] [comments]
    "Special Announcement: John Carmack & Rich Sutton partner to accelerate development of AGI" | "Carmack and Sutton are deeply focused on developing a genuine AI prototype by 2030, including establishing, advancing, and documenting AGI signs of life"
    submitted by /u/Tao_Dragon [link] [comments]
    Dumbing down or wising up: how will generative AI change the way we think?
    submitted by /u/Jariiari7 [link] [comments]
    One-Minute Daily AI News 10/13/2023
    In a recent article published in the journal Nature, researchers developed AI Tool EVEscape, a tool to forecast which severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains have the highest potential to escape host immunity.[1] Microsoft seems to be working on the possible development of an artificial intelligence (AI) system that can understand and resolve customer support requests using natural language processing.[2] Google’s Search Generative Experience (SGE) will let you create images right from a text prompt starting Thursday.[3] The Biden administration is considering closing a loophole that gives Chinese companies access to American artificial intelligence (AI) chips through units located overseas, according to four people familiar with the matter.[4] Sources: [1] https://www.news-medical.net/news/20231012/EVScape-New-tool-to-forecast-which-SARS-CoV-2-variants-could-dodge-our-immunity.aspx [2] https://winbuzzer.com/2023/10/11/microsoft-gears-up-for-a-revolutionary-natural-language-customer-support-ai-xcxwbn/ [3] https://www.theverge.com/2023/10/12/23913337/google-ai-powered-search-sge-images-written-drafts [4] https://www.reuters.com/technology/biden-eyes-adding-ai-chip-curbs-chinese-companies-abroad-2023-10-13/ submitted by /u/Excellent-Target-847 [link] [comments]
    I’ve created a audiobook generator anyone got any books to test on it? Each character is given a different voice.
    Also if anyone has anyone who should be a voice actor included in it it can also clone voices. Idk I need to make sure it works for a wide variety of books. As long as they don’t use ‘ for quotes cause the computer getts that confused when “ I’ve “ and such uses the same symbol submitted by /u/Impossible_Belt_7757 [link] [comments]
    Check out the latest episode of my history podcast on the future of A.I.!
    submitted by /u/ErikSlader713 [link] [comments]
    Drew a picture in paint, threw it in hotpot, and it came out a stylish, halloweenish picure. Damn this stuff is amazing.
    submitted by /u/kipaxbooks [link] [comments]
  • Open

    [P] App for iOS and M1 macOS for image bounding box annotation
    ClassifyML is an application for creating specialised image datasets for use with an ML training algorithm. Simply import your chosen images into the app via file manager, drag'n'drop or the on device camera and create your bounding boxes and then export your images and JSON into a structured folder. LINK: https://apps.apple.com/app/classify-ml/id6461013113 https://preview.redd.it/dicsq9d3k1ub1.png?width=313&format=png&auto=webp&s=7976a61f599c658d948dec12db0b8ec93274ad93 https://preview.redd.it/3tswxdd3k1ub1.png?width=313&format=png&auto=webp&s=56ca30546984402f4dbba628b73732918e921758 https://preview.redd.it/y0xelmz3k1ub1.png?width=313&format=png&auto=webp&s=a755ea61bc247c6aacb61a31c700e4e80a1ed69f submitted by /u/LiamRogers99 [link] [comments]  ( 9 min )
    [D] What are the best resources for learning reinforcement learning?
    Recently I came across Open AI's Spinning Up Project, which seems to be well structured, but quite introductory. What are some resources you use for learning RL? submitted by /u/OwnAd9305 [link] [comments]  ( 9 min )
    [D] LLM for entity/scene recognition in a book?
    Hello, I'm looking for an open source LLM that can extract all the characters from an inputted book, and isolate passages with descriptive writing that involves imagery. Can anyone suggest me something? Thanks! submitted by /u/slomorosh [link] [comments]  ( 9 min )
    [P] Deploy and Run LLMs at the Edge: Use Code Llama to Generate a Dashboard in a Network Restricted Environment
    In this blog, we explore different definitions of “the edge,” and understand the factors driving AI/ML to the edge. We examine why the trends of LLMs and edge computing are intersecting now, and how teams can take advantage of their combined power today. We also demonstrate how LLMs can be used in an edge environment to generate insights for a real-world use case today. Consider a geologist working in a remote oil field who is responsible for building and analyzing 3D models of oil fields to determine production capacity and the impact on profitability. In this demo, we walk through how Code Llama, Chassisml.io, and Modzy could be used to build a dashboard that geologists could use to analyze well data in real-time in a remote, network restricted environment, allowing for LLM insights generated at the edge. Learn more: https://www.modzy.com/modzy-blog/deploy-and-run-llms-at-the-edge submitted by /u/modzykirsten [link] [comments]  ( 9 min )
    [D] ICLR submissions are out. Discussion thread
    https://openreview.net/group?id=ICLR.cc/2024/Conference submitted by /u/_puhsu [link] [comments]  ( 8 min )
    [D] Vscode issue
    I am running AutoTokenizer from transformers on vscode. The vscode crashes showing error and not responding. I don't understand what's wrong. submitted by /u/ArtichokeOne5897 [link] [comments]  ( 8 min )
    "[P]" Utilizing Machine Learning Techniques for Document Digitalization Project
    Hey Guys, ​ I am currently spearheading a project for a client in the insurance industry, with a primary objective being the digitalization of thousands of hardcopy contracts. The ultimate goal is to automatically extract particular information from these newly digital documents, namely "date", "insurance premium", "insurance type", and "contractor's name". However, I anticipate a level of variability in terms of exact terminology used, particularly with regards to "insurance premium" and "insurance type". (There is no handwritten text) ​ I am keen on sharing the methodology I intend to apply for this project and invite your invaluable feedback and suggestions: ​ - Firstly, I'll execute the scanning/digitalization of the documents manually. - Post this, I plan to utilize Tesseract in combination with Python for the extraction of text from the preprocessed images. - I am considering using libraries such as NLTK or spaCy to preprocess this text (this will involve steps like lower casing, removing punctuations, etc.) - Finally, I plan to train a custom model for Named Entity Recognition (NER), to accommodate the potential semantic variations in entity labeling which are specific to entities like "insurance premium" and "insurance type". ​ I would be immensely grateful if I could gain your insights on the above-proposed pipeline - Are there any glaring pitfalls I need to avoid or perhaps some improvements that I could incorporate? Your expert advice can certainly help ensure the success of this venture. ​ Many thanks in anticipation for your time and valuable inputs! submitted by /u/Background_Thanks604 [link] [comments]  ( 9 min )
    [News] AI & ML conference in San Francisco [Special discount code for this subreddit]
    I work for this database company SingleStore and we are hosting a AI & ML conference in San Francisco on 17th of October, 2023. It is an in-person conference with amazing speakers line-up like Harrison Chase, co-founder and CEO of LangChain and many more. We will have hands-on workshops, swags giveaway and much more. I don't know if it makes sense to share this but I believe it might help some of you near San Francisco to go and meet the industry leaders and network with other data engineering folks. Use my discount coupon code 'PAVAN100OFF' to avail 100% off on the ticket price. (the original ticket price is $199) Get your tickets now! submitted by /u/PavanBelagatti [link] [comments]  ( 9 min )
    Using RAG on CoreML version of Llama2 [P]
    Has anyone ever attempted this or finetuning before on the CoreML version? I’m currently trying to and I’m not even sure where to start tbh. CoreML version of Llama 2: https://huggingface.co/coreml-projects/Llama-2-7b-chat-coreml submitted by /u/Inside-Aromatic [link] [comments]  ( 9 min )
    [D] How does L1 Regularization able to drive a coefficient to zero?
    Hi all, I’m studying the concepts of machine learning. However, I am stuck because I still don’t see how introducing a penalty using lasso regression can drive some parameter coefficients to zero. When doing the calculations, I only get the final value (ordinary least squares + penalty) and don’t directly see a coefficient value being reduced. I've looked at many materials and resources trying to explain this, but I still can't see how it's done. I think the important thing for me is seeing it going to zero or, at the very least, seeing it during calculation. Is there anyone that can help explain this better? Or, If you know of a formula that I can derive that, during the derivation process, shows a coefficient being reduced or set to zero, that would also help. Also, any good resources on the topic would be appreciated. Edit: This post should have been posted in r/learnmachinelearning here is a link to the same post in that subreddit submitted by /u/thismymind [link] [comments]  ( 9 min )
    [D] How do you pre-pay OpenaAI compute credit with university funds ?
    I am an academic and I have some funding. However, I cannot just plug in my lab card with a recurrent payment, procedures don't allow it. Is there a way to "top up" some compute credits on the OpenAI accounts ? Is anyone having the same problem ? Thanks. submitted by /u/Jean-Porte [link] [comments]  ( 9 min )
    [R] Seeking Guidance on Efficiently Classifying and Cleansing Automotive Data with Python
    Hi, we are working on a project that involves dealing with messy automotive data, and are looking for guidance on possible approaches and tools. We aim to map messy supplier data of car makes/models to standardized values from our approved list. This requires handling various challenges like typos, varied specificity, and sometimes research-based mapping (e.g., using engine size and production year to ascertain a chassis code). eg: If a supplier provides 'BNW 316i saloon 1990-1994', (typo intentional) we would like to match it to our standardized value of 'BMW 3 Series (E36)'. Our old approach has been a combination of utilizing fuzzy matching for typos/basic matching and time consuming manual processing and verification. We have recently experimented with using GPT for providing guess…  ( 10 min )
    [R] Lemur: Harmonizing Natural Language and Code for Language Agents
    Today's conversational bots like Claude and GPT can chat impressively but aren't great at complex planning or executing technical tasks. To overcome this, new research from HKU builds open-source AI agents that blend natural language and coding skills. They're called Lemur and Lemur-Chat. The researchers think achieving versatile real-world agents requires models that integrate both fluid natural language abilities and precise programming language control. Humans combine plain speech for higher-level goals with languages like Python when we need to plan intricately and execute exactly. AI needs both capacities too. But most existing models specialize in pure language or pure code. There's a separation that is limiting. The team created Lemur by pretraining the open-source Llama-2 on a massive mixed corpus with 10x more natural language than code. This improved its programming abilities while retaining conversational strength. Further instruction tuning optimized Lemur-Chat for following free-form directions in language. Experiments found Lemur surpassed specialized coding-only models like Codex in overall benchmarks. Lemur-Chat then exceeded Lemur by 15% after instruction tuning. More importantly, Lemur-Chat won 12/13 new "agent tests" designed to mimic real-world challenges needing both language and programming prowess. It beat alternatives at: Using tools like Python and Wikipedia to enhance reasoning Debugging code by leveraging error messages Improving the most from natural language feedback Exploring partially observable environments like cybersecurity and web browsing simulations. Lemur-Chat matched GPT-3.5 in many tests, closing the gap between commercial and open-source agents. TLDR: New open-source AI agents combine coding and language skills. Experiments show the combo unlocks more performance across technical challenges. Full summary is here. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [P] Introducing PPO and Rainbow DQN to our super fast evolutionary HPO reinforcement learning framework
    Hi, we've just released a new version of AgileRL, our evolutionary hyperparameter optimisation framework built for RL that is 10x faster than SOTA. We've introduced PPO, Rainbow DQN, some sophisticated replay buffers, and also collaborated with the Farama Foundation to create some tutorials (more on the way). Please check it out and take it for a spin. We're also looking for contributors so get in touch if you would like to be involved! https://github.com/AgileRL/AgileRL submitted by /u/nicku_a [link] [comments]  ( 9 min )
    [P] Free open-source ML observability course: starts October 16 🚀
    Hi everyone, I’m one of the creators of Evidently, an open-source (Apache 2.0) tool for production ML monitoring. We’ve just launched a free open course on ML observability that I wanted to share with the community. The course covers: 📚 Key concepts of ML monitoring and observability (data drift, data and model quality metrics, etc.) 🔡 Monitoring unstructured data (embeddings, texts, LLMs, etc.) 🛠 Different deployment architectures (batch ML monitoring jobs, near real-time ML monitoring, etc.) The course is free and open. All materials are public, with no sign-up required. You’ll work with open-source tools like Evidently, MLflow, Airflow, and Grafana. We’ve already published the first 12 videos with notes and code examples. We’ll add new lessons and deployment blueprints over the following weeks. The official course start date is October 16, 2023. You can also learn at your own pace. Course info and notes: https://learn.evidentlyai.com/ [Background] We’ve been working on Evidently since late 2020 and have spoken to 100s of data scientists, ML engineers, and ML platform teams in different industries. In this course, we tried to sum up answers to the frequent questions on the topic. It starts with high-level theoretical modules and goes to complete deployment blueprints. It is approachable for different levels of knowledge, and you can pick only the modules you are interested in. Looking forward to meeting you at the course! submitted by /u/mllena [link] [comments]  ( 9 min )
    Can I use ArcPro to do machine learning on point (numeric) data? [D] [R]
    I am trying to do machine learning in ArcPro, and I want to understand the relationship between x, y, numeric variable 1, numeric variable 2, and one nominal variable (classified; i.e. can be one of four values). I'd like to be able to predict numeric variable 1 based on everything else. Can ArcPro accommodate machine learning for anything other than raster type data. That is, can it be used to do machine learning on point (numeric) data? Thanks! submitted by /u/arcgis_123 [link] [comments]  ( 9 min )
    [R] TimeGPT : The first Generative Pretrained Transformer for Time-Series Forecasting
    In 2023, Transformers made significant breakthroughs in time-series forecasting For example, earlier this year, Zalando proved that scaling laws apply in time-series as well. Providing you have large datasets ( And yes, 100,000 time series of M4 are not enough - smallest 7B Llama was trained on 1 trillion tokens! ) Nixtla curated a 100B dataset of time-series and trained TimeGPT, the first foundation model on time-series. The results are unlike anything we have seen so far. I published the results in my latest article. I hope the research will be insightful for people who work on time-series projects. Link: https://aihorizonforecast.substack.com/p/timegpt-the-first-foundation-model Note: If you know any other good resources on very large benchmarks for time series models, feel free to add them below. ​ submitted by /u/nkafr [link] [comments]
    [R] Pointers to (deep) latent variable models that admit analytical approximations
    Hi everyone. I am aware that there is a plethora of deep generative models out there (e.g. variational autoencoders (VAE), GANs) that can model high-dimensional data as the images of latent variables under a non-linear mapping (typically neural network). In more traditional methods such as probabilistic PCA, the latent variables can be marginalised analytically. In Bayesian PCA (BPCA), we can additionally integrate out the linear mapping, from the latent space to the observation space, by adopting the variational lower bound that leads to closed form updates of the parameters. The Gaussian Process Latent Variable (GPLVM) model adopts a non-linear probabilistic mapping (a Gaussian process) that can be marginalised. These two models enjoy to a certain degree analytical solutions concerning the inference of the latent variables and the mapping. I have been wondering whether there is any research into more "complex" models (perhaps I should call them deep) that are capable of modelling more complex data distributions than the GPVLM and BPCA, but retain analytical solutions when inferring the posterior of the latent variables (like BPCA) or the mapping (like GPLVM)? What I like about the GPLVM and BPCA is that they possess an objective function (i.e. ELBO) that can be analytically optimised, as opposed to the intractable objective of VAEs that necessitates Monte-Carlo averages and stochastic gradient. Could somebody please point me to such examples of more complex generative models that admit analytical inference for working out the posterior of the latent variables or the mapping? ----- This has also been posted on stack exchange: https://ai.stackexchange.com/q/42418/61537 submitted by /u/ngiann [link] [comments]  ( 9 min )
    [D] I love teaching! But I don't have enough publication for it, what should I do?
    Do I love teaching? Oh, absolutely, YES a big YES! My time as a TA for countless semesters has been amazing. Staying after hours, spending long evenings and early mornings, to make each of my students find ease in debugging both easy-peasy and mind-boggling programs – it’s been a joy, truly. Watching those fresh faces, whom I introduced to Python in their first year ( intro to programming lab), now immerse themselves into my computer vision labs, exploring computer vision and deep learning in their third/forth year – it’s incredibly rewarding! And yeah my students kind of like me! after each semester I get tons of emails thanking me and my TAship review is always good. But, ugh, do I have enough publications to become faculty? A big fat NO! My efforts have been relentless, and everyone in my department would nod in agreement. But luck and reviewers? Not my best pals, apparently. So yeah, I don’t have a stack of 8 top-tier papers. I’ve managed to scrape together 3, and a few second tiers. My citation count is not that bad somewhere between 200 and 300-ish. Now, what’s next for me? Dive into the industry? become a high school teacher? Or perhaps, do a postdoc journey, fingers crossed for a sprinkle more luck and few more papers? Edit: This doesn't mean I don't like research, I actually love it too, I have done quite a few internship in quite big companies, most of the time they extend my intership and I even got publication out of one in 5 month. But I just like to teach a lot! strangely I got social anxiety every where other than my classrooms/labs. submitted by /u/LongjumpingSchool646 [link] [comments]  ( 9 min )
    [D] You don't need a Vector Database you just need a database
    I'm seeing some architectures come out from the LLM world that probably wouldn't survive the trip to production. If you choose a vector database how will you handle your other database needs? Then you'll need 2 databases. https://bionic-gpt.com/blog/you-dont-need-a-vector-database/ submitted by /u/purton_i [link] [comments]  ( 9 min )
    [D] Why back-propagation is intractable of MoCO key encoder?
    In the original paper of MoCo, it said that: Using a queue can make the dictionary large, but it also makes it intractable to update the key encoder by back-propagation (the gradient should propagate to all samples in the queue). First I thought that the main reason that the bp cannot imply on key encoder is that the queue operation is not differentable. But It seems not true. You can compute the gradient of all samples in the queue, then bp should be performed properly. See the code at the bottom. So WHAT IS THE REAL REASON THAT THE BP IS INTRACTABLE FOR KEY ENCODER? In my opinion, I think may be because of the large size of the queue (dictionary) which makes the memory explosive. python q = nn.Linear(768,128) k = nn.Linear(768,128) bs = 64 ks = 4095 model = nn.ModuleList([q,k]) x = torch.randn(bs, 768) optim = torch.optim.SGD(model.parameters(),lr=0.01) loss = nn.CrossEntropyLoss() def forward(x): xq = q(x) xk = k(x + 0.1) que = torch.rand(ks,128) pos = torch.einsum("nc,nc->n",xq,xk) neg = torch.einsum("nc,kc->nk",xq,que) out = torch.cat([pos.unsqueeze(-1),neg],dim=1) t = torch.zeros(out.shape[0],dtype=torch.long) l = loss(out,t) return l loss = forward(x) loss.backward() optim.step() submitted by /u/whishtLF [link] [comments]  ( 9 min )
    [D] Advisor rejects every idea I propose.
    A senior phd student at a moderately famous university. I have a reasonable number of accepted papers as first author in tier-1 conferences. I was thinking of going into academia, so recently I started proposing many ideas to my advisor so that I can mentor some junior students. However my advisor is rejecting every idea I suggest saying it won’t work. I’m feeling very dejected and I feel like I should give up going into academia. I don’t know what I’m expecting from here. Is your advisor like this too? submitted by /u/mildlyphd [link] [comments]  ( 9 min )
  • Open

    Batch calibration: Rethinking calibration for in-context learning and prompt engineering
    Posted by Han Zhou, Student Researcher, and Subhrajit Roy, Senior Research Scientist, Google Research Prompting large language models (LLMs) has become an efficient learning paradigm for adapting LLMs to a new task by conditioning on human-designed instructions. The remarkable in-context learning (ICL) ability of LLMs also leads to efficient few-shot learners that can generalize from few-shot input-label pairs. However, the predictions of LLMs are highly sensitive and even biased to the choice of templates, label spaces (such as yes/no, true/false, correct/incorrect), and demonstration examples, resulting in unexpected performance degradation and barriers for pursuing robust LLM applications. To address this problem, calibration methods have been developed to mitigate the effects of t…  ( 93 min )
  • Open

    Significance of AI in the development of software products
    Artificial Intelligence (AI) is emerging as a formidable force, revolutionizing how we conceive, create, and deliver software solutions. As technology advances at an unprecedented pace, the role of AI in this domain has become increasingly significant. It’s no longer just a buzzword; it’s a fundamental tool that promises to reshape the entire software development process.… Read More »Significance of AI in the development of software products The post Significance of AI in the development of software products appeared first on Data Science Central.  ( 19 min )
    Future of AI and data science – How to secure a bright career
    Companies, more often, pay attention to automation and innovation over proficiency and productivity. However, firms can maintain a balance between both due to the extensive usage of AI and data science programs. Here are the stats that show the impact of AI and data science in diverse sectors: Applications of AI and data science have… Read More »Future of AI and data science – How to secure a bright career The post Future of AI and data science – How to secure a bright career appeared first on Data Science Central.  ( 21 min )
  • Open

    A question
    What are the ways to create plasticity in neural network? Without using weights,bias and activation functions? submitted by /u/Sith_vader3 [link] [comments]  ( 8 min )
    Neural Networks project
    Hi ! My group (4 people) has chosen to make an application that translates ancient stone inscriptions to modern languages as our university project . We can use external libraries to process images that we are going to translate but as we understood we have to build the neural network ourselves from scratch. My questions are 1) is this possible to do within 10 months? 2) if so how would you approach it ? submitted by /u/sakith123 [link] [comments]
  • Open

    From Skylines to Streetscapes: How SHoP Architects Brings Innovative Designs to Life
    At SHoP Architects, a New York City-based architectural firm, Mengyi Fan and her team aim to inspire industry professionals to create visual masterpieces by incorporating emerging technologies. Fan, the director of visualization at SHoP, has expertise that spans the fields of architectural visualization and design. She takes a definitive, novel and enduring approach to designing Read article >  ( 6 min )
  • Open

    Introducing PPO and Rainbow DQN to our super fast evolutionary HPO reinforcement learning framework
    Hi, we've just released a new version of AgileRL, our evolutionary hyperparameter optimisation framework built for RL that is 10x faster than SOTA. We've introduced PPO, Rainbow DQN, some sophisticated replay buffers, and also collaborated with the Farama Foundation to create some tutorials (more on the way). Please check it out and take it for a spin. We're also looking for contributors so get in touch if you would like to be involved! https://github.com/AgileRL/AgileRL submitted by /u/nicku_a [link] [comments]
    Masking state transitions in policy updates for invalid actions?
    I am currently dealing with an environment, that most of the time (90% of all state transitions) clips the action selected from the agent. Sometimes even down to the point where the action selected by the agent is completly ignored. This causes a lot of problems, because for example the entropy bonus does not works, since the agent learns to select any action, when it doesn't matter anyway but selects the same action (low entropy) when the actions have an effect. Using the PPO algorithm I was thinking of masking the state transitions in the policy updates, according to how much the action was clipped in the environment. And I thought V(s) should be masked, because it can still learn from the state transitions even if the action was effectively ignored by the environment. submitted by /u/flxh13 [link] [comments]
    A question about deterministic action selection at evaluation time
    I'm training some agents using fairly vanilla PPO on a hand-made environment. These agents learn to perform the task pretty well, but while I was examining their action probabilities during an evaluation episode, I had the idea to turn off deterministic action selection. To my surprise, allowing probabilistic action selection (as opposed to argmax action selection) actually improved performance in some cases. I had always thought that deterministic actions during evaluation was fairly standard, but now am thinking that maybe I missed something and that there are cases where you wouldn't want determinism? My question is: how common is it actually to use deterministic actions vs. probabilistic ones at evaluation time, and does anyone know of studies/papers/examples where the authors found probabilistic evaluation to outperform determinism? submitted by /u/Impallion [link] [comments]
    "A Simple Open-Loop Baseline for Reinforcement Learning Locomotion Tasks" Raffin et al. 2023
    submitted by /u/atooo57 [link] [comments]
    Looking for some advice regarding universal multi-head outputs
    Hey, So I am working on reinforcement learning package in C# (currently under heavy development): https://github.com/asieradzk/RL_Matrix My goal is to create something superior to unity's ML Agents for Godot to democratize access to reinforcement learning for people (without having them know what a tensor is) So far I've added some barebones DQN and PPO that (only output single discrete action) as proof of concept to test my code architecture. So I am going through the daunting task of having some universal workflow for setting up environments. For any shape observations and any count actions, both discrete and continuous. As I am finishing my multi-head multi-action output I've come to realise that there are many possible architectures I could setup multi head outputs, for instanc…
    Next state in turn based game
    To my knowledge, when using the Q Learning family algorithm, we must know the next state as well as the action spaces in couple with that observation in order to evaluate the reward for the next state with the target network. But I have some problem when trying to define this next state in turn turn-based game in which the agent have to make a certain number of actions and then wait for the opponent to do some actions before it can interact with the environment again. We can take Hearthstone as an example that each player have to wait for other to play a number of cards before can take any action. Currently, I have two options for this: - Treat the environment right after the agent's turn ended, which will lack the action space. - Treat the environment just before the agent's turn begins, which will have all the actions available that it can choose from but this will make the agent's last action very noisy. That state could be a good state if the opponent playing badly or they are very good and make our last decision seem like a very bad choice. Thanks in advance for any suggestions. If my problem is a common task that others have already solved many times, I will be very thankful for that keyword. submitted by /u/No-Concentrate-6037 [link] [comments]
    "Small batch deep reinforcement learning", Obando-Ceron et al 2023 {DM} (value-based agents explore & regularize better with small n)
    submitted by /u/gwern [link] [comments]

  • Open

    How are memories stored in neural networks? | The Hopfield Network #SoME2
    submitted by /u/keghn [link] [comments]
    A question
    How does the neural network process input that were same but shown different to the network model? submitted by /u/Sith_vader3 [link] [comments]
    I don't much about NN's. is this correct ?
    i gave chatgpt vision an illustration of neural network from The Principles of Deep Learning Theory. what to know how correct its reponse is here is the response: https://preview.redd.it/inqe5xukxptb1.png?width=453&format=png&auto=webp&s=6e1079baeae8235b0e03a677e4006d1077af36a8 submitted by /u/YeshwanthRam [link] [comments]
  • Open

    Who Will Benefit from AI?
    Artificial intelligence (AI) can provide "machine usefulness" for human workers, augmenting their jobs rather than replacing them. However, there is a concern that AI could lead to job displacement and reinforce economic inequality. MIT economist Daron Acemoglu emphasizes the importance of making AI more useful to humans and ensuring that the economic benefits are shared widely. He suggests that innovations that augment workers' tasks can lead to prosperity for the workforce. Acemoglu also highlights the need for worker power and the careful implementation of technology to achieve shared prosperity and productivity gains. Source : https://idss.mit.edu/news/who-will-benefit-from-ai/ submitted by /u/NuseAI [link] [comments]
    What's the most advanced free chatbot available?
    I just need three things for it: It must be knowledgeable about things, such as physics, math, hystory, books, geography, etc. It also must be original, with a high level of SEO and AI detection score. It must be available in Italy. The last part is essential. Claude 2 is very famous but with sms verification from usa (which I don't have and I don't want to give credit card info/pay to have) it's made almost impossible even with vpn. submitted by /u/luigirovatti1 [link] [comments]
    10 Powerful ChatGPT Hacks for SEO
    submitted by /u/Senior_tasteey [link] [comments]
    ChatGPT's Global Peace Plan
    Creating true, enduring, lasting peace on Earth is an ambitious and complex endeavor that requires multifaceted approaches. Here’s a bold, outside-the-box plan that may surprise you: Step 1: Establish a Global Consciousness: Educational Overhaul: Revamp global educational systems to foster empathy, understanding, and appreciation for diverse cultures, religions, and viewpoints from a young age. Step 2: Eradicate Poverty and Inequality: Universal Basic Assets (UBA): Implement a Universal Basic Assets program, where every person on Earth is granted a share of global resources. Step 3: Create a Single Global Governance Entity: World Federation: Establish a democratically elected World Federation that respects regional autonomy but has overriding authority on global issues like…
    When your AI says she loves you
    submitted by /u/thisisinsider [link] [comments]
    Anyone ever thought about training a video generating model, but backwards?
    Just had a random idea: What if you train a video generating AI, but feed it videos that are reversed? You could show it an image of a crashed car, and it would generate a video of the crash. Show it a broken vase, it would "repair" it. It could one day become like the "reconstruct crime scene" in Detroit: Become Human. What are your thoughts about this? submitted by /u/FluffyIllustrator805 [link] [comments]
    AI and science: what 1,600 researchers think
    A Nature survey of over 1,600 researchers reveals that AI tools are becoming increasingly common in science and are expected to be 'very important' or 'essential' in the next decade. Scientists express concerns about how AI is transforming research, including reliance on pattern recognition without understanding, bias in data, fraud, and irreproducible research. The survey shows that AI tools provide faster ways to process data, speed up computations, and save time and money. Among researchers who use AI, more than one-quarter believe AI tools will become 'essential' to their field in the next decade. Large language models like ChatGPT are mentioned as both impressive and concerning examples of AI tools in science. Source : https://www.nature.com/articles/d41586-023-02980-0 submitted by /u/NuseAI [link] [comments]
    Looking for AI text input like Artbreeder Mixer that combines images
    I'm looking for a (free) ai image generator like Artbreeder Mixer, that has functions that allow you to "morph" or mix images together via text prompts. Ive looked at a bunch already, and even tried adding the text of the different types in the prompts, bu Ive been getting separated results (like "cat" , "man", "head" wont combine the man and the cat, but rather give me un-morphed results, like a regular man, plus a cat in a suit with no human features. I even get a result with a man standing behind a cat! Ive tried StarryAI, imagecreator, wepik, cant afford midjourney or paid ones right now, some others I cant remember with no mixing... Artbreeder's interface, you can keep adding and it will mix them together. I made these images and others like them very easy in Artbreeder, but its plan is very limited - I could buy more credits, but I need to wait a few days (new job, not paid yet, broke today... lol): ​ morph between man and donkey Morph between angry rapper and gorilla SO, if anyone can suggest some free, or almost free (generous newbie credits?) that can do mixes like this - please point me in the right direction. submitted by /u/magusat999 [link] [comments]
    New York wants to be AI's world capital, in rivalry with San Francisco and Silicon Valley
    submitted by /u/norcalnatv [link] [comments]
    Could an AI-created profile picture help you get a job?
    Artificial intelligence (AI) is being used to create professional-looking profile pictures for job hunting websites like LinkedIn. Apps like Remini, Try It On AI, and AI Suit Up use AI-based software to generate slick profile photos that mimic the work of expert photographers. Users upload multiple selfies, and the AI software creates artificial photos with different hairstyles, clothing, and backdrops. While some find the results realistic, others think they look artificial. The AI services are popular because they are cheap or free, making them accessible to those who can't afford professional headshots. However, opinions are divided on whether AI-generated photos are beneficial or detrimental to self-esteem. Some believe that AI-generated photos allow individuals to put their best self forward and potentially increase their chances of being considered for opportunities. Others worry that relying on AI-generated photos may negatively impact self-worth and confidence. Recruiters generally do not consider whether a photo is AI-generated when evaluating job applications. Source : https://www.bbc.co.uk/news/business-67054382 submitted by /u/NuseAI [link] [comments]
    AI Tool for film footage notes
    Hi, im currently filming a documentary, but I’m so busy filming, i don’t have time to write notes on footage for the editor. Does anyone know of any ai tool that can help with this and save time and streamline this process? King regards submitted by /u/Brand0n_C [link] [comments]
    How AI will affect traditional and open source software industry?
    Hey folks, how would you guys see the effect of AI? Will the small softwares companies will go bankrupt? Since the lots of software are using tools like ChatGpt, Midway Journey etc. It just the starting of new AI technology era which will evolved over the years. In that time we will see more and more AI software which will likely provide efficient and better solution as compare to traditional and open source software. So my question is how do you guys see this? Will small software companies or open source software programs days are number? submitted by /u/Haziq12345 [link] [comments]
    One-Minute Daily AI News 10/11/2023
    Opera has launched Opera One — a new version of the browser that comes packaged with an AI-powered chatbot called Aria.[1] Adobe is going all in on AI, announcing three new generative AI models today that add powerful features to Illustrator and Adobe Express and vastly improve Photoshop’s text-to-image capabilities.[2] ‘South Park’ to Tackle AI for Next Event Special, Releases Teaser.[3] World’s first AI tutor launched in Australia to help students get through their exams.[4] Sources: [1] https://www.theverge.com/2023/6/21/23768888/opera-one-browser-aria-ai-assistant-chatbot [2] https://www.theverge.com/2023/10/10/23911114/adobe-max-firefly-generative-ai-model-photoshop-illustrator-express [3] https://www.hollywoodreporter.com/tv/tv-news/south-park-ai-joining-panderverse-1235615276/ [4] https://www.techguide.com.au/news/computers-news/worlds-first-ai-tutor-launched-in-australia-to-help-students-get-through-their-exams/ submitted by /u/Excellent-Target-847 [link] [comments]
    Cypher 2023: The Future of Simulation and Design is AI
    submitted by /u/Agitated-Spell3979 [link] [comments]
    Any ideas how this was created?
    submitted by /u/crispyTacoTrain [link] [comments]
    Web design tools
    I’m looking for input and advice on tools for web designers. I use Wordpress a lot, Magento some and frequently code by hand in html JavaScript and PHP. I know there are some AI tools out there now but I don’t know which are best and wanted to find out what people thoughts are on this subject. What tools are you using, for what, and why? Thanks! submitted by /u/PowerTarget [link] [comments]
  • Open

    [R] Researchers Identify Emergent Linear Structures in How LLMs Represent Truth
    LLMs' tendency to make up false statements (hallucinate) is a major concern. We need ways to inspect whether they really "know" something is true or not so we can reduce hallucinations. In a new paper, researchers found that LLMs contain an internal "truth vector" - an emergent linear structure that represents factual truth values. They had the insight to visualize how GPT represents simple true/false sentences. The true ones clustered together, while false ones clustered elsewhere - suggesting some kind of 'truth direction' in its learned representations. To test this, they trained linear "probes" on one dataset, and found they could generalize to accurately detect truth values in totally different datasets about other topics. They also directly modified the models to add or subtract the identified truth vectors from its processing of statements. This could flip assessments of truth value, showing the vector causally influences reasoning. Together, these findings provide evidence that neural networks can create emergent, linear structures that represent factual truth. This finding could eventually help make AI systems less prone to hallucinations and falsehoods. TLDR: LLMs can create emergent linear representations of truth. This sheds light on how AI represents abstract concepts and could help us reduce hallucinations. Full summary. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
    [D] Recommendations request for a guide to research publication
    I am working on a research topic in Data Engineering. Forgive me if this is a question frequently asked, I couldn't find this specifically in the FAQ. What are good publication tips and journals to publish in? I read through a few journals and all of them are big publications. What if I opt fot some upcoming or other niche (maybe data engineering) journals submitted by /u/Sherbhy [link] [comments]  ( 9 min )
    [R] SWE-bench: Can Language Models Resolve Real-world GitHub issues?
    We have a new benchmark out called SWE-bench (arxiv) It challenges LMs to solve real GitHub issues (feature requests & bug reports) from popular Python repos. Answers are validated using unit tests we crawled from those repos. The benchmark at swebench.com/ shows that even the strongest models, such as Claude 2 and GPT-4, get less than 5% accuracy. ​ We are here to answer any questions you may have. submitted by /u/ofirpress [link] [comments]  ( 9 min )
    [D] Sample probability diffusion models
    I would like to understand how I can calculate the probability that a sample belongs to the distribution a diffusion model was trained on. Say, I have an image of a car, and I would like to know whether this image belongs to the distribution that is estimated by the diffusion model. So I would like to know the probability between zero and one at the car belongs to this distribution Do you know how I technically can do this? submitted by /u/That_Phone6702 [link] [comments]  ( 9 min )
    [Discussion] Making a Tutorial for Using a New Platform for ML in the climate and earth science space
    Hey guys Looking for some ideas. I'm building out a jupyter book that will be a tutorial on how to use a research platform for data analysis and modelling. My PI has given me free liberty over it. I can not think of a good idea to do the analysis and build the model on. It does not need to be complex but should be good enough so that any researcher, student or organization using the platform can get a good idea of how to use it for ML. Any thoughts on a good area to look into? Any recommendations? Note this will be a tutorial and as such an overly complex model is unnecessary. I just can not figure out what to look into so hoping you guys could give thoughts about possible areas in climate, weather and earth science that I could focus on for the tutorial in the jupyter book. submitted by /u/AdditionalFun3 [link] [comments]  ( 9 min )
    [D] Submitting a paper rejected by EMNLP to ARR
    First time submitting to ARR here. I was quite confused about this paper resubmission thing. I got rejected by EMNLP (submission directly to EMNLP with openreview) a week ago and I am planning to resubmit it to the ARR system (also using openreview). Does this EMNLP submission count as a previous ARR submission that should be mentioned or not? Do I need to withdraw the paper from EMNLP openreview prior to submitting it to ARR openreview? submitted by /u/Icy-Distribution6887 [link] [comments]  ( 9 min )
    [D] [P] UI-based AI agents: UI-Act
    Hi! Happy to share a project I've been working on for a while: UI-Act https://github.com/TobiasNorlund/UI-Act It's an AI model architecture designed to autonomously navigate and interact with computers using the graphical user interface. Think of it as a co-pilot that "sees" your screen and acts on it, just as a human would. In essence, it's a custom transformer model taking prompt and screenshots as input, with output heads to predict low-level actions i.e. mouse clicks. In the demo, it has been trained to compute simple expressions in a calculator window, using expert demonstrations/behavior cloning. If scaled up appropriately however, it could provide a basis for a general agent to automate arbitrary tasks on a computer. I would be interested in hearing your thoughts on it, and especially with regards to the trend towards general AI agents and assistants (Windows Copilot / Adept ACT-1 / AutoGPT etc). LMs equipped with e.g. function-calling is a trendy approach, that rely on text-based state representations and APIs to take action. In cases where this is unfeasible, UI-based agents might provide a more general alternative. As the agent's interface to the computer is shared with humans, it can be easily taught using expert demonstrations, and require little or no technical expertice. Let me know what you think! submitted by /u/tobibbelfuel [link] [comments]  ( 9 min )
    [P] Learn how to make trustworthy and transparent machine learning models in Tsetlin Machine Book Chapter 7: Confidence, Trustworthiness, and Composites.
    ​ Confidence and trustworthiness of Tsetlin Machines. Hi all! Just completed a new chapter in the book An Introduction to Tsetlin Machines: https://tsetlinmachine.org Happy to receive feedback! Abstract: Collaboration can be essential to manage complex projects. One example is building a house. You then need the expertise of carpenters, plumbers, and electricians. Each profession brings unique skills to the table. Similarly, different types of Tsetlin machines can have distinct capabilities. In this chapter, you learn how Tsetlin machines can team up, allowing them to achieve more than they could on their own. The effectiveness of a team relies on recognising each member's strengths and limitations. Appreciating where your expertise stops and where your coworkers' expertise begins is crucial for effective collaboration. We first explore how Tsetlin machines can assess their competence in Section 7.1. Using the vote count from Chapter 1, you learn to measure how confident a Tsetlin machine is when it makes its decisions. It is possible to be highly confident and still perform poorly. To be trustworthy, confidence must be in line with one's capabilities. Therefore, Section 7.1 also covers how to evaluate trustworthiness. Next, in Section 7.2, you discover how to build a team of Tsetlin machines with different skills. By assessing each Tsetlin machine's confidence, you can lean on the confident ones when making decisions. The result is a Tsetlin machine composite - a construction where multiple Tsetlin machines join forces. You can think of it as a composite material, such as epoxy, which reinforces resin with fibres, making it strong, lightweight, and durable. submitted by /u/olegranmo [link] [comments]  ( 9 min )
    [R] [D] Need Peer Review: Unsupervised Learning for Student Dropout Anomaly Detection
    Hello all, Just wrapped up Task 1.1 for anomaly detection in student dropout rates. Keen for some extra eyes on it. Task Highlights: Data Pre-processing & Normalisation K-Means Clustering Gaussian Anomaly Detector Used PCA for dimensionality reduction Links to the following files: data.csv Task 1.1 - Rubric.pdf Task1.1Script.ipynb https://drive.google.com/drive/folders/17XcjEoYCrDWqf90VVNdkLAkYNdtWWwGu?usp=sharing Would greatly appreciate any feedback! Cheers! submitted by /u/Nook31 [link] [comments]
    [R] A method to assess trustworthiness of machine coding at scale
    submitted by /u/mnky9800n [link] [comments]  ( 8 min )
    [P] [vilays] Prototype Video Demo - Any Feedback from ML Engineers?
    Hi everyone, I’m thrilled to share a prototype we've been tirelessly working on. We are developing a virtualization environment for applications, specifically tailored to engineers, designers, data scientists, and researchers. In a nutshell, our platform enables users to run cloud-hosted desktop apps from any device, making it appear as if the applications are installed on their local machines, while they're actually operating on a remote server. The ultimate goal is to obliterate barriers between local and cloud execution, especially for compute-intensive workloads, thereby allowing seamless usage of High-Performance Computing software on the cloud with the scalability to adjust computing resources as per necessity. We’re here to solicit your invaluable feedback on our product video demo. Your insights will not only help us identify any blind spots and enhance our solution but also better understand the needs and preferences of our potential user base. 📽 [https://youtu.be/QR8FWRnPrXM?feature=shared] We're eagerly awaiting your thoughts and appreciate you taking the time to help us refine our product! Thank you! :) submitted by /u/aaron-cesaro [link] [comments]
    [D] Databricks Dolly 15k - Creating Synthetic Variants
    Hey all, I found Dolly to be a very interesting project when it was released but I'm curious if it has similar value today because a lot of synthetic data generation options seem to be popping up. Now it seems like Dolly is human generated/curated by over 5k employees (which is great), but wouldn't it be a better approach now to have Llama70b (or maybe Falcon) just generate future variants of 15k rows? I havent been able to figure out why we arent seeing more synthetic datasets like this on HF? Is the bottleneck licensing, compute or just incentive? Heres the original Dolly post thread: https://www.reddit.com/r/MachineLearning/comments/120usfk/r_hello_dolly_democratizing_the_magic_of_chatgpt/ submitted by /u/buzzyness [link] [comments]
    [D] Please suggest a Loss function for image to image task.
    What is the loss function that needs to be used for a task that takes an input image with a lot of haze and produces an image with reduced haze. The architecture is a simple encoder decoder architecture. I tried MSE as some articles and ML guides say that MSE is good for pixel wise comparison and also tried Categorical Crossentropy but none of them work so great. MSE works but produces artefacts like red/green/ blue spots and spatters and at worse times it produces a white image. The research on this task includes use of SIDNet[Single Image Dehazing Net], Transmission maps, Dark channel prior algorithm, FFA net, etc trained on the Benchmark datasets (RESIDE,SOTS). I aim to create a simple architecture for college project so I chose the Enc-Dec architecture. Any suggestions are appreciated. submitted by /u/Wild_Basil_2396 [link] [comments]
    [D] Startup team demonstrates differentiable Swift compiler outrunning TensorFlow by 322X
    Autonomous systems startup, PassiveLogic, assembled a differentiable computing team, to build a fast systems language with native performance differentiability. Their latest benchmark trains networks two orders of magnitude faster than PyTorch and Tensorflow. See: LinkedIn Post&dashCommentUrn=urn%3Ali%3Afsd_comment%3A(7118052434916110337%2Curn%3Ali%3Aactivity%3A7117911978106355712)) It's a collaborative effort with the Swift community and Apple's compiler team, using the Swift language as a strongly typed embedded language that performs ahead of time compilation of graph neural nets. The focus is on fusing systems programming and AI engineering into a single native high performance language, to enable typed heterogeneous inference and training. The compiler development is open sourced as part of the standard Swift package. Try it yourself at swift.org. submitted by /u/taharvey [link] [comments]  ( 9 min )
    [D] How is test-driven development implemented in the context of machine learning?
    I recently tried to refactor a previous project that I had, but I realized that after making all of the changes the performance wasn't reproducible anymore. I decided to start from scratch, make incremental changes, and make sure that the model's performance is maintained with each change. Very basic in hindsight, but I guess I was too hasty with coding. Anyway, running the full model's training and evaluation with each change is proving to take too long. I'm curious if there's any other way that people implement TDD in the context of machine learning since projects/applications tend to be more time consuming then typical applications. submitted by /u/Seankala [link] [comments]
  • Open

    Developing industrial use cases for physical simulation on future error-corrected quantum computers
    Posted by Nicholas Rubin, Senior Research Scientist, and Ryan Babbush, Head of Quantum Algorithms, Quantum AI Team If you’ve paid attention to the quantum computing space, you’ve heard the claim that in the future, quantum computers will solve certain problems exponentially more efficiently than classical computers can. They have the potential to transform many industries, from pharmaceuticals to energy. For the most part, these claims have rested on arguments about the asymptotic scaling of algorithms as the problem size approaches infinity, but this tells us very little about the practical performance of quantum computers for finite-sized problems. We want to be more concrete: Exactly which problems are quantum computers more suited to tackle than their classical counterparts, an…  ( 94 min )
  • Open

    UK Tech Festival Showcases Startups Using AI for Creative Industries
    At one of the U.K.’s largest technology festivals, top enterprises and startups are this week highlighting their latest innovations, hosting workshops and celebrating the growing tech ecosystem based in the country’s southwest. The Bristol Technology Festival today showcased the work of nine startups that recently participated in a challenge hosted by Digital Catapult — the Read article >  ( 6 min )
    Get in Gear: ‘Forza Motorsport’ Races Onto GeForce NOW
    Put the pedal to the metal this GFN Thursday as Forza Motorsport leads 23 new games in the cloud. Plus, Acer’s Predator Connect 6E is the newest addition to the GeForce NOW Recommended program, with easy cloud gaming quality-of-service (QoS) settings built in to give Ultimate members the best streaming experience. No Breaks, No Limits, Read article >  ( 6 min )
  • Open

    DeepMind 2022 'full accounts' financial report: 2022 budget: £1,081 million ($1.3b) (decreased by a fifth from 2021)
    submitted by /u/gwern [link] [comments]
    RL for non-Python environments?
    Most real world applications for RL (robotics, game dev, finance) are in not normally done in Python, yet all major RL frameworks are written in Python. Is there a good/high-performance cross-language framework to do RL in other languages like C++/.Net/Java? If not, do you think people would be interested in such a framework? ​ submitted by /u/xor24 [link] [comments]
    Reinforcement learning agents that adhere to a causal model of the problem
    Do you know any work that tries to develop RL agents that exploit some sort of high-level model of the problem (it could also be given by an expert human) to learn faster or operate on out-of-distribution scenarios? I'm particularly interested in Causal Models, but any similar thing could be interesting for me submitted by /u/fedetask [link] [comments]
    What is the intuitive explanation for using log probabilities in Policy gradient methods instead of simple probabilities? does it improve gradient descent optimization ?
    submitted by /u/aabra__ka__daabra [link] [comments]
    Why does Drq-v2 sample from replay by episode then experience?
    I've been looking at DrQ-v2 (https://github.com/facebookresearch/drqv2) recently and it samples from replay in a way that seems odd to me but may have a purpose I don't understand. They store experiences in a compressed file by episode, this makes some sense since it means they don't have to store everything in RAM and they delay disk writes until the end of the episode so they don't slow down the sim operation. On sampling, they randomly select an episode then randomly select an experience from the episode, calculating the n-step reward dynamically at sample time instead of at experience storage time. This is then fed to the model by a pytorch DataLoader. This means a _lot_ of disk reads during the optimization step which can't be ideal but I'll put that aside. What is the advantage of doing this selection by episode? It may give a better spread across episodes in each update, but I'm not sure that makes up for the potential downsides of making prioritization and other replay tricks much harder. Any ideas? submitted by /u/EDMismyO2 [link] [comments]
    Can reinforcement learning models learn to rank?
    I have a very simple observation: a list of random value state = [random.uniform(-0.2, 0.2) for _ in range(200)] reward = state * actions . The reward is not using the next state, it's using the previous state i gave to the model. So basically i already give the answer to the model, the best action is : if state > 0 action =1, if state < 0 action = -1 I tried using PPO, but it seem not learning at all. My test_env.py is here: ``` import gymnasium as gym import numpy as np from gymnasium import spaces from gymnasium.utils import seeding from stable_baselines3.common.vec_env import DummyVecEnv import random class TestEnv(gym.Env): metadata = {"render.modes": ["human"]} def __init__( self, item_count, test_steps, is_train = True, ): self.is_train = is_train self.test_steps = test_step…
  • Open

    Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets
    These research papers were presented at the IEEE Symposium on Visual Languages and Human-Centric Computing (opens in new tab) (VL/HCC 2023), a premier forum for design, theory, and application of computing technologies for programming, modelling, and communication. Large language models (LLMs) have revolutionized the way novice programmers and everyday computer users tap into the capabilities […] The post Microsoft at VL/HCC 2023: Focus on co-audit tools for spreadsheets appeared first on Microsoft Research.  ( 10 min )
  • Open

    Homework problems are rigged
    This post is a follow-on to a discussion that started on Twitter yesterday. This tweet must have resonated with a lot of people because it’s had over 230,000 views so far. You almost have to study advanced math to solve basic math problems. Sometimes a high school student can solve a real world problem that […] Homework problems are rigged first appeared on John D. Cook.  ( 7 min )
  • Open

    12 Generative AI Trends to Watch Out for
    The advent of generative AI is empowering everyone alike – organizations, small businesses, individuals, students, and medical professionals, to name a few. The last couple of years have been revolutionary for artificial intelligence innovation and transformation. How will 2024 shape up for AI, AI tools, and related professionals? Let’s analyze the trends that are most… Read More »12 Generative AI Trends to Watch Out for The post 12 Generative AI Trends to Watch Out for appeared first on Data Science Central.  ( 20 min )

  • Open

    Predictive AI analyzing attraction to facial features (iris Dating app)
    Top dating apps Tinder, Hinge and Bumble have all stated that they're already investing in AI to make their apps better. They're using it to verify profiles, match people based on bios and interests, and help generate profile descriptions and liven conversations. But what about machine learning on user photos? iris Dating uses AI to analyze user input in the form of liking or disliking faces ("swiping" profiles). We all know if we like blondes or brunettes, blue or brown eyes, short or long hair, beard or no beard, etc. But AI can pick up the subtlest features (proportions, distances, curvatures etc.) and build a face map. A matrix of features, if you will. It doesn't just look for a person looking like your favorite celebrity crush. It understands what you're really attracted to. From there it's an easy path: if it knows which features attract me, it can predict my level of attraction to a specific individual (specifically, their face). Find the persons with the highest predicted attractiveness (for me, not for everyone), rank them by attraction for me, and we have a potential high mutual attraction match. The two stats I have are that on average women like 55%(!) of the profiles iris picks for them; and that users have 40x higher chances of matching when they've trained the model to understand their taste. I know it takes a lot more than a pretty face to make for a great relationship, but it sure doesn't hurt to start with strong physical attraction. Missed connections on Craigslist are about just that: seeing a face you can't forget. Find me more of these "wow" faces and let's go from there. What do you think? Is it too early? Too bold? Too niche? submitted by /u/akahamlet [link] [comments]
    Superman if portrayed by different actors (as imagined by AI)
    submitted by /u/fat_n_stupid [link] [comments]
    DALL·E 3 is blocking copyrighted material. Also DALL·E 3:
    submitted by /u/Zimmax [link] [comments]
    The AI research job market shit show
    The AI research job market is going through a shakeup, with a high demand for skilled researchers and a scarcity of talent. Companies closely monitor the movements of researchers as an indicator of their ability to transition from concept to product. The market is highly competitive, with researchers being offered high salaries and compensation packages. This has led to high turnover and attrition in many companies, causing unsettledness among employees. Despite the challenges, the investment in AI research is expected to drive innovation and push the boundaries of the Transformer architecture. Source : https://www.interconnects.ai/p/ai-research-job-market submitted by /u/NuseAI [link] [comments]
    Are there any low res (pixel art) art tools?
    I'm looking for ways to create art for a game I'm creating. submitted by /u/Yenii_3025 [link] [comments]
    Inverting Transformers Significantly Improves Time Series Forecasting
    Transformers are great at NLP and computer vision tasks, but I was surprised to learn they still lag behind simple linear models at time series forecasting. The issue is how most Transformer architectures treat each timestamp as a token and fuse all the variable data from that moment. This makes two big problems: Variables recorded at slightly different times get blurred together, losing important timing info Each token can only see a single moment, no long-term dependencies So Transformers struggle to extract useful patterns and correlations from the data. Some researchers from Tsinghua University took a fresh look at this and realized the Transformer components themselves are solid, they just need to flip the architecture for time series data. Their "Inverted Transformer" (or iTransformer): Makes each variable's full history into a token, instead of each timestamp Uses self-attention over variables to capture relationships Processes time dependencies per variable with feedforward layers This simple tweak gives all the benefits we want: State-of-the-art forecasting accuracy, beating both linear models and standard Transformers Better generalization to unseen variables Increased interpretability Ability to leverage longer historical context TLDR: Inverting Transformers to align with time series structure allows them to outperform alternatives in working with time series data. Full summary. Paper is here. submitted by /u/Successful-Western27 [link] [comments]
    Best ChatGPT Plugins: Ultimate List for 2023
    submitted by /u/Senior_tasteey [link] [comments]
    The NSFW dream (truely unrestricted ai desires)
    I guess I'm looking for the impossible but does anyone know of a generator that has all of the following qualities in order of importance least to most important: Has a massive variety of styles like Womba's private discord server does. "Create variants" function like how a Womba discord personal server generator allows you to do. Generates beautiful "digital art" style images like the digital https://www.unstability.ai/ does. (Man those images are pretty) faces are really good most of the time. (It's frusterating as it looks so good but I can't seem to get any group sex poses going on.) Provides a variety of poses such as https://easywithai.com/ai-image-generators/promptchan-ai/ which also allows you to upload you own images for poses, like how I could upload a real life orgy image and as long as it could distinguish the bodies as being separate (not a big pile of limbs) it does pretty good, but lacks severely lacks in facial quality. Like a big booty girl in hyperreal style 1080P or higher resolution. (Again Womba is good here, but they are just extreme on their restrictions.) 1080P should be the minimum for any paid service as how can we truely enjoy a full screne image on anything less without it pixeling out? Doesn't cost $150/month (yes I found one that does all this but their premium subscription cost like $150/month (seduced.ai) and it's not even unlimited. I paid $90 for a full year at Womba discord unlimited but again, $150/month is just not worth it. If anyone knows of a server that has all these for around $25/month or less, please let me know. If really appreciate it. submitted by /u/russader [link] [comments]
    Can AI reference both photos to make the black and white photo the same as the colour image?
    I have a high resolution black and white print and a generic quality colour image of the same photo, that I'd like AI to look at both images and make the B&W into colour. Is this possible? submitted by /u/NikonD3X1985 [link] [comments]
  • Open

    [D] how to download datasets from huggingface
    Hello, first time using Google Colab and huggingface datasets. Colab notebook is easy to setup but I can't seem to figure out how to download datasets from huggingface. I am trying to download https://huggingface.co/datasets/kili-technology/plastic_in_river dataset in Colab Notebook. After reading some beginners forums, I modified the example to look like one below but it failed. from datasets import load_dataset data_files = {"train": "train.csv", "test": "test.csv", "validation": "validation.csv"} dataset = load_dataset("kili-technology/plastic_in_river", data_files=data_files) Because there's no path to the files to be downloaded. Can someone explain how to download datasets from huggingface please? Downloading builder script: 100% 3.25k/3.25k [00:00 in () 2 3 data_files = {"train": "train.csv", "test": "test.csv", "validation": "validation.csv"} ----> 4 dataset = load_dataset("kili-technology/plastic_in_river", data_files=data_files) 5 frames /usr/local/lib/python3.10/dist-packages/datasets/data_files.py in resolve_pattern(pattern, base_path, allowed_extensions, download_config) 366 if allowed_extensions is not None: 367 error_msg += f" with any supported extension {list(allowed_extensions)}" --> 368 raise FileNotFoundError(error_msg) 369 return out 370 FileNotFoundError: Unable to find 'https://huggingface.co/datasets/kili-technology/plastic_in_river/resolve/main/train.csv' submitted by /u/0ni0nrings [link] [comments]  ( 9 min )
    [D] How do byte-level language models work?
    I've recently been trying to pre-train my own small language model on the tiny-series datasets on huggingface. I also wanted to use a model similar to MEGABYTE but I don't understand how using bytes would work. The only implementation I could find from lucidrains used str(chr(max(32, token))) to decode any token (byte) to a character and put the embedding size as 256. Firstly, why 256 and not 256-32 as any values below 32 are ignored? Also, many byte-level models including this and ByteT5 mention that they can process any text sequence even in a multilingual setting, however how would that be true if we are only using one byte, would we have to move to 2 bytes or use an UNK token, and if we did use 2 bytes that would make our embedding size around 65000 which defeats sort of the point as o…  ( 10 min )
    [P] Evaluating and tuning a model when the population may change YoY and best practices for mitigating overfitting on features that correlate with time.
    Consider a predictive model that is predicting if an outcome Y will occur in Q1 2023, based on data from Q1 2022. Now, if want to predict outcomes for 2024, we must use last years data to build the model, but we are going to have some bias if there are features that vary year over year. Is the best approach in such a situation to try and tune/validate the model with other years in the hopes of mitigating any features that are correlated with a specific year? Any help would be much appreciated, as I can't find agreed upon methods. submitted by /u/unga123 [link] [comments]  ( 9 min )
    Is there a model to input anecdotal text stories as training data to return a more comprehensive story? [P]
    I have a goal and am looking for direction from others who know more than me about machine learning. I want to submit 5-10 pieces of text to a model. The text will be anecdotes from a common experience but each one from a different person’s perspective. For example, if a family visits a theme park, each family member will have a story or two about the day. Each family’s story would be a submission to the model. One person might have loved the roller coaster and can tell about the exciting parts. Another person maybe just can’t stop talking about how great he food was. Someone else maybe felt sick and complains the line at the bathroom was too long. Perhaps another family member also rode the same roller coasters as the first person but instead hated it, so would have a very different description of it than the first. All these anecdotes are submitted to the model. Then, the model can be queried. Such as, “Tell me about the theme park.” or “I love roller coasters. Tell me about the theme park.” or “I tend to overeat, tell me about the theme park.” (the model wouldn’t hype of the food, maybe it would talk about how much exercise the visitors get by walking around all day.) In this case of a theme park context, the model would have a preconception of a theme park. It would know the general concept, know of several examples or standards that it could compare this theme park against, understand it’s all for fun, etc. This type of model may be available as an API or model already and I just don’t know about it. That’d be fine, please point me towards it. Or, maybe there’s something already available but would need tweaked or customized. submitted by /u/Semper_Disco [link] [comments]  ( 10 min )
    [D] Help me learn ML easily specially in model building and EDA
    Can you give easy to understand sources and hands-on practice methodology to master ML? Help me understand build the models in and out . Thank you submitted by /u/the_mystic_1 [link] [comments]  ( 9 min )
    NSF workshop on LLMs in chemistry education [R]
    Over Feb 12-13 of 2024, the National Science Foundation (NSF) is sponsoring a workshop titled “Integrating LLMs into the Materials Chemistry Curriculum” in Golden, Colorado. We aim to explore and develop innovative ways to incorporate large language models (LLMs, e.g. GPT, ChatGPT, and Bard) into upper division chemistry laboratories and virtual lab experiences. During the workshop, participants will brainstorm and create demonstrations incorporating LLMs into the curriculum. The event will bring together folks across academia and the private sector with disciplinary backgrounds that range across chemistry, computer science, materials science, physics, and education. There is no registration fee, and we anticipate being able to cover the majority of participant travel costs thanks to NSF support. Participants early in their career (i.e., graduate students, postdoctoral scholars) are particularly encouraged to apply. If you are interested in participating in this workshop, please fill out the Google form (link below). Please feel free to distribute this invitation widely. Application: https://forms.gle/P9QdNiCuaUAHFZj29 submitted by /u/KC2792 [link] [comments]  ( 9 min )
    [P] Where to find projects to contribute to?
    Hello, I'm a developer with 6 years of experience in the mobile field, and I recently completed my master's degree in artificial intelligence (Text mining). I want to transition into the field of AI, but I need more experience with projects in the "real world," outside of academia, and I'd like to contribute to an open-source project. I looked on Github, but I ended up feeling confused and not sure where to start. P.S.: I did some research in this subreddit, but the posts about contributions seemed a bit dated. submitted by /u/Substantial_Fact_205 [link] [comments]  ( 9 min )
    [P] Image based Python + OpenCV automation, MMORPG Laghaim Auto-Fighter Bot Demo
    Video: https://youtu.be/0m12vkaoE7w ​ Detailed Medium post will follow in the upcoming days. https://medium.com/@pssdplayer submitted by /u/HistorianCrafty3514 [link] [comments]  ( 9 min )
    [D] - I have 20-30 million shopify products dataset, any ideas?
    I have collected over 20 million shopify products & had the following ideas for them: - LLM ( Finetune an llm to know how to speak ecom ) - Video bot that can make videos on those products, using their description, elevenlabs & AIFaceGen - EcomStore that will markup the products about 30% ( This will need the bot to frequently scrape, to ensure that the products are up to date ) - Selling the dataset based on fragments, like 1$ per 1k-10k records, depends on what sells. Please let me know if these are good ideas, and if someone would like to support / help me in any way ( I just need to selfhost my supabase instance, & add all the products to it & then dev can get started ) submitted by /u/AdonisCodes [link] [comments]  ( 9 min )
    [D] Best open-source AI model for QA generation from context
    As the title says I’m looking for an open-source AI model for generating question-and-answers with a correct answer option and explanation to the correct answer from the input context. So far I have tried these models, TheBloke/Llama-2-7B-GPTQ TheBloke/Llama-2-13B-GPTQ TheBloke/Llama-2-7b-Chat-GPTQ (the output is not consistent. Sometimes I get an empty response or without the correct answer option and an explanation data) TheBloke/Llama-2-13b-Chat-GPTQ (even 7b is better) TheBloke/Mistral-7B-Instruct-v0.1-GGUF(so far this is the only one that gives the output consistently. But not able to generate more than 2 QA due to max token limit of 512. Even tried setting the max token as 1024, 2048 but nothing helped) TheBloke/Mistral-7B-OpenOrca-GGUF NousResearch/Llama-2-7b-chat-hf My system configurations are: Windows 10 with 16GB GPU Additional Information: The input prompt token will be around 250-350 tokens per request. submitted by /u/gokulcv [link] [comments]  ( 9 min )
    Churn Prediction [R]
    I want to build a model to predict churn in a third party logistics company. What variables should make up my data? Any help would do. Thanks submitted by /u/DisastrousAd8814 [link] [comments]  ( 9 min )
    [D] Recommendations for CPU-Based Real-Time Vector Database Indexing and Matching?
    Hello everyone, I have a specific online vectorization use case: I'm looking to search the internet for articles, vectorize these articles along with the search queries, and then retrieve the most relevant passages from them. Currently, I have basic hosting through DigitalOcean. Could anyone recommend the most suitable vector dataset for this task? Additionally, considering my resources, is it feasible to run this system solely on CPUs? And if so, would this setup be scalable if deployed on CPUs only? submitted by /u/Traditional-Poet2746 [link] [comments]  ( 9 min )
    [R] network digital twin for cybersecurity
    Hi all, for a text work of mine I am trying to do a project based on generating digital twin of networks. My goal is to create a digital twin of a network and then work on it from a cyber security point of view. I will briefly explain what I would like to do. I am currently using software for network vulnerability scans (OpenVAS). I use this software to perform network vulnerability scans at the network level, so basically to OpenVAS I pass a network (for example 192.168.xx.xx/24) to automatically identify all the vulnerabilities that are there. The next step ( what I'd like to do and that's why I'm asking for your advice) is to create a digital twin of the newly scanned network and then perform a penetration test on this digital twin of the network, without going to stress the actual network. Ideally, I would like to pass the output of the OpenVAS vulnerability scans, routing rules, and firewall rules to some tool that will then generate for me the digital twin of the network, which will then be used for offensive cybersecurity, so exploits, privilege escalation, etc.... will be tested on this digital twin without worrying about breaking some kind of service or stressing the real network. What I am asking is, do you know of any tool that would do the trick for me? So some tool that allows me to generate a digital twin of a network by providing as input vulnerability scans (xml,json,csv etc...), routing rules, firewall rules, pcap traces etc... Do you have any references or documentation? Are you aware of any open source tools? I thank you for your helpfulness! ​ submitted by /u/Salt-Arugula-8128 [link] [comments]  ( 9 min )
    Best approach for VFX lineups using ML [Project]
    Quick intro Lineups are one of the first steps in the VFX pipeline Source: - orignal footage that was shot on set - a reference (quicktime) video from the film edit. Task: The reference shows modifications to the original footage. They can be : - timewarp (either fixed retimes like 200% speed or completely random) - transform (moved the image in x/y axis, rotation, scale, etc.) So the lineup task is to align the original footage to the reference quicktime. What I did so Far: Made a simple script in the software Nuke, using some Python and readily available tools to make it work on a simple shot. General logic is compare every frame and the associated one is the frame with the least difference between the two. This works on super simple and straightforward tasks. (can provide more info if needed). Issue: Some references are more heavily modified. They can have some muzzle flash, basic 3d objects or even some slight error introduced like a distortion applied to the image when none shouldn't so it will never be perfectly aligned. This makes the difference of the full frame higher for some frames, making the lineup wrong. (it will take the wrong frame that has no muzzle flash, because it has less difference...)Some other things to consider is that watermarks are covering the ref and the colors are not perfectly matching, can get them close enough, but there's a difference. Conclusion: Because of those issues, I'm thinking about using Machine Learning. I have next to no knowledge on the subject. I know there Is a bunch of ways to train a model, but no clue where to start, so here's my question : Which learning styles has the best potential to be able to solve this task? submitted by /u/Pretty_Customer_8113 [link] [comments]  ( 9 min )
    [R] What are some interesting research topics to study in the intersection of ML and signal processing currently?
    I will have to pick and start a research project next January for my final year. So wanted to start exploring now. I want to do something substantive and interesting enough to get published. submitted by /u/BadMeditator [link] [comments]  ( 9 min )
    [R] Mistral 7B
    submitted by /u/hardmaru [link] [comments]  ( 8 min )
    [R] Tsinghua University: Inverting Transformers Significantly Improves Time Series Forecasting
    Transformers are great at NLP and computer vision tasks, but I was surprised to learn they still lag behind simple linear models at time series forecasting. The issue is how most Transformer architectures treat each timestamp as a token and fuse all the variable data from that moment. This makes two big problems: Variables recorded at slightly different times get blurred together, losing important timing info Each token can only see a single moment, no long-term dependencies So Transformers struggle to extract useful patterns and correlations from the data. Some researchers from Tsinghua University took a fresh look at this and realized the Transformer components themselves are solid, they just need to flip the architecture for time series data. Their "Inverted Transformer" (or iTransformer): Makes each variable's full history into a token, instead of each timestamp Uses self-attention over variables to capture relationships Processes time dependencies per variable with feedforward layers This simple tweak gives all the benefits we want: State-of-the-art forecasting accuracy, beating both linear models and standard Transformers Better generalization to unseen variables Increased interpretability Ability to leverage longer historical context TLDR: Inverting Transformers to align with time series structure allows them to outperform alternatives in working with time series data. Full summary. Paper is here. submitted by /u/Successful-Western27 [link] [comments]  ( 9 min )
  • Open

    Neural Networks From Scratch in Rust
    submitted by /u/zezeartix [link] [comments]  ( 8 min )
    Activation function for generating Shapley values
    Hi, I want to train a neural network to calculate Shapley values based on a given characteristic function. Depending on a given characteristic function, calculated through a dedicated algorithm, Shapley values can be any number, positive or negative, without a set range. Because of this, I am unsure, for the specific application of calculating Shapley values, what activation function to use in a neural network that would calculate them. The relu function, as well as leaky relu function, either cannot give values that are negative or have trouble giving large negative values, and sigmoid or tanh can only give values in a certain range. I am aware that there are other commonly used activation functions, but all the ones I could find had one of these issues, which would make training a network to calculate Shapley values difficult. Any advice? submitted by /u/PowNotBigSurprise [link] [comments]  ( 9 min )
  • Open

    Improve performance of Falcon models with Amazon SageMaker
    What is the optimal framework and configuration for hosting large language models (LLMs) for text-generating generative AI applications? Despite the abundance of options for serving LLMs, this is a hard question to answer due to the size of the models, varying model architectures, performance requirements of applications, and more. The Amazon SageMaker Large Model Inference […]  ( 13 min )
    Index your web crawled content using the new Web Crawler for Amazon Kendra
    In this post, we show how to index information stored in websites and use the intelligent search in Amazon Kendra to search for answers from content stored in internal and external websites. In addition, the ML-powered intelligent search can accurately get answers for your questions from unstructured documents with natural language narrative content, for which keyword search is not very effective.  ( 7 min )
  • Open

    Python code for means
    The last couple article have looked at various kinds of mean. The Python code for four of these means is trivial: gm = lambda a, b: (a*b)**0.5 am = lambda a, b: (a + b)/2 hm = lambda a, b: 2*a*b/(a+b) chm = lambda a, b: (a**2 + b**2)/(a + b) But the arithmetic-geometric mean […] Python code for means first appeared on John D. Cook.  ( 5 min )
  • Open

    Research Focus: Week of October 9, 2023
    Research Focus: Principal researcher Lester Mackey recognized for pioneering statistical and ML techniques; Pareto frontiers in neural feature learning; structural inequality in the influencer industry; new research on cardinality estimation. The post Research Focus: Week of October 9, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Take the Wheel: NVIDIA NeMo SteerLM Lets Companies Customize a Model’s Responses During Inference
    Developers have a new AI-powered steering wheel to help them hug the road while they drive powerful large language models (LLMs) to their desired locations. NVIDIA NeMo SteerLM lets companies define knobs to dial in a model’s responses as it’s running in production, a process called inference. Unlike current methods for customizing an LLM, it Read article >  ( 6 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-11-10T00:44:12.774Z osmosfeed 1.15.1